1. 15 7月, 2014 1 次提交
  2. 25 6月, 2014 4 次提交
    • O
      aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() · 855ef0de
      Oleg Nesterov 提交于
      ioctx_add_table() is the writer, it does not need rcu_read_lock() to
      protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
      add the confusion.
      
      And it doesn't need rcu_dereference() by the same reason, it must see
      any updates previously done under the same ->ioctx_lock. We could use
      rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
      the function is simple enough.
      
      The same for kill_ioctx(), although it does not update the pointer.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      855ef0de
    • O
      aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() · 4b70ac5f
      Oleg Nesterov 提交于
      On 04/30, Benjamin LaHaise wrote:
      >
      > > -		ctx->mmap_size = 0;
      > > -
      > > -		kill_ioctx(mm, ctx, NULL);
      > > +		if (ctx) {
      > > +			ctx->mmap_size = 0;
      > > +			kill_ioctx(mm, ctx, NULL);
      > > +		}
      >
      > Rather than indenting and moving the two lines changing mmap_size and the
      > kill_ioctx() call, why not just do "if (!ctx) ... continue;"?  That reduces
      > the number of lines changed and avoid excessive indentation.
      
      OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
      of course, I won't argue.
      
      The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
      we reset mmap_size only for kill_ioctx(). But feel free to remove this change.
      
      -------------------------------------------------------------------------------
      Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
      
      1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
         or even rcu_dereference().
      
         This mm has no users, nobody else can play with ->ioctx_table. Otherwise
         the code is buggy anyway, if we need rcu_read_lock() in a loop because
         ->ioctx_table can be updated then kfree(table) is obviously wrong.
      
      2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
         munmap(), but another reason is that we simply can't do vm_munmap() unless
         current->mm == mm and this is not true in general, the caller is mmput().
      
      3. We do not really need to nullify mm->ioctx_table before return, probably
         the current code does this to catch the potential problems. But in this
         case RCU_INIT_POINTER(NULL) looks better.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      4b70ac5f
    • B
      aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 · edfbbf38
      Benjamin LaHaise 提交于
      A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
      by commit a31ad380.  The changes made to
      aio_read_events_ring() failed to correctly limit the index into
      ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
      an arbitrary page with a copy_to_user() to copy the contents into userspace.
      This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
      Petr for disclosing this issue.
      
      This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: stable@vger.kernel.org
      edfbbf38
    • B
      aio: fix aio request leak when events are reaped by userspace · f8567a38
      Benjamin LaHaise 提交于
      The aio cleanups and optimizations by kmo that were merged into the 3.10
      tree added a regression for userspace event reaping.  Specifically, the
      reference counts are not decremented if the event is reaped in userspace,
      leading to the application being unable to submit further aio requests.
      This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
      This issue was uncovered as part of CVE-2014-0206.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      f8567a38
  3. 07 5月, 2014 1 次提交
    • A
      new methods: ->read_iter() and ->write_iter() · 293bc982
      Al Viro 提交于
      Beginning to introduce those.  Just the callers for now, and it's
      clumsier than it'll eventually become; once we finish converting
      aio_read and aio_write instances, the things will get nicer.
      
      For now, these guys are in parallel to ->aio_read() and ->aio_write();
      they take iocb and iov_iter, with everything in iov_iter already
      validated.  File offset is passed in iocb->ki_pos, iov/nr_segs -
      in iov_iter.
      
      Main concerns in that series are stack footprint and ability to
      split the damn thing cleanly.
      
      [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      293bc982
  4. 01 5月, 2014 1 次提交
  5. 30 4月, 2014 2 次提交
  6. 23 4月, 2014 1 次提交
  7. 17 4月, 2014 1 次提交
    • A
      aio: block io_destroy() until all context requests are completed · e02ba72a
      Anatol Pomozov 提交于
      deletes aio context and all resources related to. It makes sense that
      no IO operations connected to the context should be running after the context
      is destroyed. As we removed io_context we have no chance to
      get requests status or call io_getevents().
      
      man page for io_destroy says that this function may block until
      all context's requests are completed. Before kernel 3.11 io_destroy()
      blocked indeed, but since aio refactoring in 3.11 it is not true anymore.
      
      Here is a pseudo-code that shows a testcase for a race condition discovered
      in 3.11:
      
        initialize io_context
        io_submit(read to buffer)
        io_destroy()
      
        // context is destroyed so we can free the resources
        free(buffers);
      
        // if the buffer is allocated by some other user he'll be surprised
        // to learn that the buffer still filled by an outstanding operation
        // from the destroyed io_context
      
      The fix is straight-forward - add a completion struct and wait on it
      in io_destroy, complete() should be called when number of in-fligh requests
      reaches zero.
      
      If two or more io_destroy() called for the same context simultaneously then
      only the first one waits for IO completion, other calls behaviour is undefined.
      
      Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
        do not see the race condition anymore.
      Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e02ba72a
  8. 28 3月, 2014 1 次提交
    • B
      aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration · fa8a53c3
      Benjamin LaHaise 提交于
      As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
      exist in the aio ring page migration support.
      
      As a result, for example, we have the following problem:
      
                  thread 1                      |              thread 2
                                                |
      aio_migratepage()                         |
       |-> take ctx->completion_lock            |
       |-> migrate_page_copy(new, old)          |
       |   *NOW*, ctx->ring_pages[idx] == old   |
                                                |
                                                |    *NOW*, ctx->ring_pages[idx] == old
                                                |    aio_read_events_ring()
                                                |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                                |     |-> ring->head = head;          *HERE, write to the old ring page*
                                                |     |-> kunmap_atomic(ring);
                                                |
       |-> ctx->ring_pages[idx] = new           |
       |   *BUT NOW*, the content of            |
       |    ring_pages[idx] is old.             |
       |-> release ctx->completion_lock         |
      
      As above, the new ring page will not be updated.
      
      Fix this issue, as well as prevent races in aio_ring_setup() by holding
      the ring_lock mutex during kioctx setup and page migration.  This avoids
      the overhead of taking another spinlock in aio_read_events_ring() as Tang's
      and Gu's original fix did, pushing the overhead into the migration code.
      
      Note that to handle the nesting of ring_lock inside of mmap_sem, the
      migratepage operation uses mutex_trylock().  Page migration is not a 100%
      critical operation in this case, so the ocassional failure can be
      tolerated.  This issue was reported by Sasha Levin.
      
      Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
      Instead, make page migration fully serialised by mapping->private_lock, and
      have aio_free_ring() simply disconnect the kioctx from the mapping by calling
      put_aio_ring_file() before touching ctx->ring_pages[].  This simplifies the
      error handling logic in aio_migratepage(), and should improve robustness.
      
      v4: always do mutex_unlock() in cases when kioctx setup fails.
      Reported-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: stable@vger.kernel.org
      fa8a53c3
  9. 23 12月, 2013 1 次提交
    • L
      aio: clean up and fix aio_setup_ring page mapping · 3dc9acb6
      Linus Torvalds 提交于
      Since commit 36bc08cc ("fs/aio: Add support to aio ring pages
      migration") the aio ring setup code has used a special per-ring backing
      inode for the page allocations, rather than just using random anonymous
      pages.
      
      However, rather than remembering the pages as it allocated them, it
      would allocate the pages, insert them into the file mapping (dirty, so
      that they couldn't be free'd), and then forget about them.  And then to
      look them up again, it would mmap the mapping, and then use
      "get_user_pages()" to get back an array of the pages we just created.
      
      Now, not only is that incredibly inefficient, it also leaked all the
      pages if the mmap failed (which could happen due to excessive number of
      mappings, for example).
      
      So clean it all up, making it much more straightforward.  Also remove
      some left-overs of the previous (broken) mm_populate() usage that was
      removed in commit d6c355c7 ("aio: fix race in ring buffer page
      lookup introduced by page migration support") but left the pointless and
      now misleading MAP_POPULATE flag around.
      Tested-and-acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dc9acb6
  10. 22 12月, 2013 2 次提交
    • B
      aio/migratepages: make aio migrate pages sane · 8e321fef
      Benjamin LaHaise 提交于
      The arbitrary restriction on page counts offered by the core
      migrate_page_move_mapping() code results in rather suspicious looking
      fiddling with page reference counts in the aio_migratepage() operation.
      To fix this, make migrate_page_move_mapping() take an extra_count parameter
      that allows aio to tell the code about its own reference count on the page
      being migrated.
      
      While cleaning up aio_migratepage(), make it validate that the old page
      being passed in is actually what aio_migratepage() expects to prevent
      misbehaviour in the case of races.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      8e321fef
    • B
      aio: fix kioctx leak introduced by "aio: Fix a trinity splat" · 1881686f
      Benjamin LaHaise 提交于
      e34ecee2 reworked the percpu reference
      counting to correct a bug trinity found.  Unfortunately, the change lead
      to kioctxes being leaked because there was no final reference count to
      put.  Add that reference count back in to fix things.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      1881686f
  11. 06 12月, 2013 1 次提交
  12. 20 11月, 2013 2 次提交
  13. 13 11月, 2013 1 次提交
  14. 09 11月, 2013 1 次提交
  15. 11 10月, 2013 1 次提交
    • K
      aio: Fix a trinity splat · e34ecee2
      Kent Overstreet 提交于
      aio kiocb refcounting was broken - it was relying on keeping track of
      the number of available ring buffer entries, which it needs to do
      anyways; then at shutdown time it'd wait for completions to be delivered
      until the # of available ring buffer entries equalled what it was
      initialized to.
      
      Problem with  that is that the ring buffer is mapped writable into
      userspace, so userspace could futz with the head and tail pointers to
      cause the kernel to see extra completions, and cause free_ioctx() to
      return while there were still outstanding kiocbs. Which would be bad.
      
      Fix is just to directly refcount the kiocbs - which is more
      straightforward, and with the new percpu refcounting code doesn't cost
      us any cacheline bouncing which was the whole point of the original
      scheme.
      
      Also clean up ioctx_alloc()'s error path and fix a bug where it wasn't
      subtracting from aio_nr if ioctx_add_table() failed.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      e34ecee2
  16. 27 9月, 2013 1 次提交
  17. 10 9月, 2013 1 次提交
    • A
      aio: rcu_read_lock protection for new rcu_dereference calls · d9b2c871
      Artem Savkov 提交于
      Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch"
      (77d30b14 in linux-next.git) introduced a
      couple of new rcu_dereference calls which are not protected by rcu_read_lock
      and result in following warnings during syscall fuzzing(trinity):
      
      [  471.646379] ===============================
      [  471.649727] [ INFO: suspicious RCU usage. ]
      [  471.653919] 3.11.0-next-20130906+ #496 Not tainted
      [  471.657792] -------------------------------
      [  471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage!
      [  471.665968]
      [  471.665968] other info that might help us debug this:
      [  471.665968]
      [  471.672141]
      [  471.672141] rcu_scheduler_active = 1, debug_locks = 1
      [  471.677549] 1 lock held by trinity-child0/3774:
      [  471.681675]  #0:  (&(&mm->ioctx_lock)->rlock){+.+...}, at: [<c119ba1a>] SyS_io_setup+0x63a/0xc70
      [  471.688721]
      [  471.688721] stack backtrace:
      [  471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496
      [  471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  471.703151]  00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824
      [  471.709544]  c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac
      [  471.716251]  c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4
      [  471.722902] Call Trace:
      [  471.724859]  [<c18a814b>] dump_stack+0x4b/0x66
      [  471.728772]  [<c10a4ec6>] lockdep_rcu_suspicious+0xc6/0x100
      [  471.733716]  [<c119bc7c>] SyS_io_setup+0x89c/0xc70
      [  471.737806]  [<c119b8dd>] ? SyS_io_setup+0x4fd/0xc70
      [  471.741689]  [<c10cf684>] ? __audit_syscall_entry+0x94/0xe0
      [  471.746080]  [<c18b1fcc>] syscall_call+0x7/0xb
      [  471.749723]  [<c1080000>] ? task_fork_fair+0x240/0x260
      Signed-off-by: NArtem Savkov <artem.savkov@gmail.com>
      Reviewed-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      d9b2c871
  18. 09 9月, 2013 1 次提交
    • B
      aio: fix race in ring buffer page lookup introduced by page migration support · d6c355c7
      Benjamin LaHaise 提交于
      Prior to the introduction of page migration support in "fs/aio: Add support
      to aio ring pages migration" / 36bc08cc,
      mapping of the ring buffer pages was done via get_user_pages() while
      retaining mmap_sem held for write.  This avoided possible races with userland
      racing an munmap() or mremap().  The page migration patch, however, switched
      to using mm_populate() to prime the page mapping.  mm_populate() cannot be
      called with mmap_sem held.
      
      Instead of dropping the mmap_sem, revert to the old behaviour and simply
      drop the use of mm_populate() since get_user_pages() will cause the pages to
      get mapped anyways.  Thanks to Al Viro for spotting this issue.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      d6c355c7
  19. 30 8月, 2013 2 次提交
  20. 08 8月, 2013 1 次提交
  21. 06 8月, 2013 1 次提交
    • B
      aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" · da90382c
      Benjamin LaHaise 提交于
      In the patch "aio: convert the ioctx list to table lookup v3", incorrect
      handling in the ioctx_alloc() error path was introduced that lead to an
      ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
      call returned -EAGAIN due to hitting the aio_max_nr limit.  Fix this by
      only calling ioctx_add_table() as the last step in ioctx_alloc().
      
      Also, several unnecessary rcu_dereference() calls were added that lead to
      RCU warnings where the system was already protected by a spin lock for
      accessing mm->ioctx_table.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      da90382c
  22. 31 7月, 2013 3 次提交
    • B
      aio: be defensive to ensure request batching is non-zero instead of BUG_ON() · 6878ea72
      Benjamin LaHaise 提交于
      In the event that an overflow/underflow occurs while calculating req_batch,
      clamp the minimum at 1 request instead of doing a BUG_ON().
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      6878ea72
    • B
      aio: convert the ioctx list to table lookup v3 · db446a08
      Benjamin LaHaise 提交于
      On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
      > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
      > > When using a large number of threads performing AIO operations the
      > > IOCTX list may get a significant number of entries which will cause
      > > significant overhead. For example, when running this fio script:
      > >
      > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
      > > blocksize=1024; numjobs=512; thread; loops=100
      > >
      > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
      > > 30% CPU time spent by lookup_ioctx:
      > >
      > >  32.51%  [guest.kernel]  [g] lookup_ioctx
      > >   9.19%  [guest.kernel]  [g] __lock_acquire.isra.28
      > >   4.40%  [guest.kernel]  [g] lock_release
      > >   4.19%  [guest.kernel]  [g] sched_clock_local
      > >   3.86%  [guest.kernel]  [g] local_clock
      > >   3.68%  [guest.kernel]  [g] native_sched_clock
      > >   3.08%  [guest.kernel]  [g] sched_clock_cpu
      > >   2.64%  [guest.kernel]  [g] lock_release_holdtime.part.11
      > >   2.60%  [guest.kernel]  [g] memcpy
      > >   2.33%  [guest.kernel]  [g] lock_acquired
      > >   2.25%  [guest.kernel]  [g] lock_acquire
      > >   1.84%  [guest.kernel]  [g] do_io_submit
      > >
      > > This patchs converts the ioctx list to a radix tree. For a performance
      > > comparison the above FIO script was run on a 2 sockets 8 core
      > > machine. This are the results (average and %rsd of 10 runs) for the
      > > original list based implementation and for the radix tree based
      > > implementation:
      > >
      > > cores         1         2         4         8         16        32
      > > list       109376 ms  69119 ms  35682 ms  22671 ms  19724 ms  16408 ms
      > > %rsd         0.69%      1.15%     1.17%     1.21%     1.71%     1.43%
      > > radix       73651 ms  41748 ms  23028 ms  16766 ms  15232 ms   13787 ms
      > > %rsd         1.19%      0.98%     0.69%     1.13%    0.72%      0.75%
      > > % of radix
      > > relative    66.12%     65.59%    66.63%    72.31%   77.26%     83.66%
      > > to list
      > >
      > > To consider the impact of the patch on the typical case of having
      > > only one ctx per process the following FIO script was run:
      > >
      > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
      > > blocksize=1024; numjobs=1; thread; loops=100
      > >
      > > on the same system and the results are the following:
      > >
      > > list        58892 ms
      > > %rsd         0.91%
      > > radix       59404 ms
      > > %rsd         0.81%
      > > % of radix
      > > relative    100.87%
      > > to list
      >
      > So, I was just doing some benchmarking/profiling to get ready to send
      > out the aio patches I've got for 3.11 - and it looks like your patch is
      > causing a ~1.5% throughput regression in my testing :/
      ... <snip>
      
      I've got an alternate approach for fixing this wart in lookup_ioctx()...
      Instead of using an rbtree, just use the reserved id in the ring buffer
      header to index an array pointing the ioctx.  It's not finished yet, and
      it needs to be tidied up, but is most of the way there.
      
      		-ben
      --
      "Thought is the essence of where you are now."
      --
      kmo> And, a rework of Ben's code, but this was entirely his idea
      kmo>		-Kent
      
      bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
      free memory.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      db446a08
    • B
      aio: double aio_max_nr in calculations · 4cd81c3d
      Benjamin LaHaise 提交于
      With the changes to use percpu counters for aio event ring size calculation,
      existing increases to aio_max_nr are now insufficient to allow for the
      allocation of enough events.  Double the value used for aio_max_nr to account
      for the doubling introduced by the percpu slack.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      4cd81c3d
  23. 30 7月, 2013 9 次提交
    • K
      aio: Kill ki_dtor · d29c445b
      Kent Overstreet 提交于
      sock_aio_dtor() is dead code - and stuff that does need to do cleanup
      can simply do it before calling aio_complete().
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      d29c445b
    • K
      aio: Kill ki_users · 57282d8f
      Kent Overstreet 提交于
      The kiocb refcount is only needed for cancellation - to ensure a kiocb
      isn't freed while a ki_cancel callback is running. But if we restrict
      ki_cancel callbacks to not block (which they currently don't), we can
      simply drop the refcount.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      57282d8f
    • K
      aio: Kill unneeded kiocb members · 8bc92afc
      Kent Overstreet 提交于
      The old aio retry infrastucture needed to save the various arguments to
      to aio operations. But with the retry infrastructure gone, we can trim
      struct kiocb quite a bit.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      8bc92afc
    • K
      aio: Kill aio_rw_vect_retry() · 73a7075e
      Kent Overstreet 提交于
      This code doesn't serve any purpose anymore, since the aio retry
      infrastructure has been removed.
      
      This change should be safe because aio_read/write are also used for
      synchronous IO, and called from do_sync_read()/do_sync_write() - and
      there's no looping done in the sync case (the read and write syscalls).
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      73a7075e
    • K
      aio: Don't use ctx->tail unnecessarily · 5ffac122
      Kent Overstreet 提交于
      aio_complete() (arguably) needs to keep its own trusted copy of the tail
      pointer, but io_getevents() doesn't have to use it - it's already using
      the head pointer from the ring buffer.
      
      So convert it to use the tail from the ring buffer so it touches fewer
      cachelines and doesn't contend with the cacheline aio_complete() needs.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      5ffac122
    • K
      aio: io_cancel() no longer returns the io_event · bec68faa
      Kent Overstreet 提交于
      Originally, io_event() was documented to return the io_event if
      cancellation succeeded - the io_event wouldn't be delivered via the ring
      buffer like it normally would.
      
      But this isn't what the implementation was actually doing; the only
      driver implementing cancellation, the usb gadget code, never returned an
      io_event in its cancel function. And aio_complete() was recently changed
      to no longer suppress event delivery if the kiocb had been cancelled.
      
      This gets rid of the unused io_event argument to kiocb_cancel() and
      kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
      kiocb->ki_cancel() returned success.
      
      Also tweak the refcounting in kiocb_cancel() to make more sense.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      bec68faa
    • K
      aio: percpu ioctx refcount · 723be6e3
      Kent Overstreet 提交于
      This just converts the ioctx refcount to the new generic dynamic percpu
      refcount code.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      723be6e3
    • K
      aio: percpu reqs_available · e1bdd5f2
      Kent Overstreet 提交于
      See the previous patch ("aio: reqs_active -> reqs_available") for why we
      want to do this - this basically implements a per cpu allocator for
      reqs_available that doesn't actually allocate anything.
      
      Note that we need to increase the size of the ringbuffer we allocate,
      since a single thread won't necessarily be able to use all the
      reqs_available slots - some (up to about half) might be on other per cpu
      lists, unavailable for the current thread.
      
      We size the ringbuffer based on the nr_events userspace passed to
      io_setup(), so this is a slight behaviour change - but nr_events wasn't
      being used as a hard limit before, it was being rounded up to the next
      page before so this doesn't change the actual semantics.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e1bdd5f2
    • K
      aio: reqs_active -> reqs_available · 34e83fc6
      Kent Overstreet 提交于
      The number of outstanding kiocbs is one of the few shared things left that
      has to be touched for every kiocb - it'd be nice to make it percpu.
      
      We can make it per cpu by treating it like an allocation problem: we have
      a maximum number of kiocbs that can be outstanding (i.e.  slots) - then we
      just allocate and free slots, and we know how to write per cpu allocators.
      
      So as prep work for that, we convert reqs_active to reqs_available.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      34e83fc6