1. 03 5月, 2018 1 次提交
  2. 20 3月, 2018 1 次提交
  3. 15 3月, 2018 2 次提交
    • T
      fs/aio: Use RCU accessors for kioctx_table->table[] · d0264c01
      Tejun Heo 提交于
      While converting ioctx index from a list to a table, db446a08
      ("aio: convert the ioctx list to table lookup v3") missed tagging
      kioctx_table->table[] as an array of RCU pointers and using the
      appropriate RCU accessors.  This introduces a small window in the
      lookup path where init and access may race.
      
      Mark kioctx_table->table[] with __rcu and use the approriate RCU
      accessors when using the field.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJann Horn <jannh@google.com>
      Fixes: db446a08 ("aio: convert the ioctx list to table lookup v3")
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # v3.12+
      d0264c01
    • T
      fs/aio: Add explicit RCU grace period when freeing kioctx · a6d7cff4
      Tejun Heo 提交于
      While fixing refcounting, e34ecee2 ("aio: Fix a trinity splat")
      incorrectly removed explicit RCU grace period before freeing kioctx.
      The intention seems to be depending on the internal RCU grace periods
      of percpu_ref; however, percpu_ref uses a different flavor of RCU,
      sched-RCU.  This can lead to kioctx being freed while RCU read
      protected dereferences are still in progress.
      
      Fix it by updating free_ioctx() to go through call_rcu() explicitly.
      
      v2: Comment added to explain double bouncing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJann Horn <jannh@google.com>
      Fixes: e34ecee2 ("aio: Fix a trinity splat")
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # v3.13+
      a6d7cff4
  4. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  5. 20 9月, 2017 1 次提交
  6. 09 9月, 2017 1 次提交
    • J
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse 提交于
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
  7. 08 9月, 2017 1 次提交
  8. 05 9月, 2017 1 次提交
  9. 28 6月, 2017 1 次提交
  10. 20 6月, 2017 2 次提交
  11. 02 3月, 2017 1 次提交
  12. 25 2月, 2017 1 次提交
  13. 20 2月, 2017 1 次提交
  14. 15 1月, 2017 1 次提交
    • S
      aio: fix lock dep warning · a12f1ae6
      Shaohua Li 提交于
      lockdep reports a warnning. file_start_write/file_end_write only
      acquire/release the lock for regular files. So checking the files in aio
      side too.
      
      [  453.532141] ------------[ cut here ]------------
      [  453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
      [  453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
      [  453.533011] Modules linked in:
      [  453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
      [  453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
      [  453.533011]  ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
      [  453.533011]  ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
      [  453.533011]  ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
      [  453.533011] Call Trace:
      [  453.533011]  [<ffffffff8196cffb>] dump_stack+0x67/0x9c
      [  453.533011]  [<ffffffff81091ee1>] __warn+0x111/0x130
      [  453.533011]  [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
      [  453.533011]  [<ffffffff81091f00>] ? __warn+0x130/0x130
      [  453.533011]  [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
      [  453.533011]  [<ffffffff811205d4>] lock_release+0x434/0x670
      [  453.533011]  [<ffffffff8198af94>] ? import_single_range+0xd4/0x110
      [  453.533011]  [<ffffffff81322195>] ? rw_verify_area+0x65/0x140
      [  453.533011]  [<ffffffff813aa696>] ? aio_write+0x1f6/0x280
      [  453.533011]  [<ffffffff813aa6c9>] aio_write+0x229/0x280
      [  453.533011]  [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640
      [  453.533011]  [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
      [  453.533011]  [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
      [  453.533011]  [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
      [  453.533011]  [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0
      [  453.533011]  [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
      [  453.533011]  [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
      [  453.533011]  [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
      [  453.533011]  [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
      [  453.533011]  [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  453.533011]  [<ffffffff813acb90>] SyS_io_submit+0x10/0x20
      [  453.533011]  [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
      [  453.533011]  [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
      [  453.533011] ---[ end trace b2fbe664d1cc0082 ]---
      
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a12f1ae6
  15. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  16. 25 12月, 2016 1 次提交
  17. 23 12月, 2016 1 次提交
    • A
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro 提交于
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  18. 05 12月, 2016 1 次提交
  19. 31 10月, 2016 4 次提交
  20. 28 9月, 2016 1 次提交
  21. 16 9月, 2016 1 次提交
    • J
      aio: mark AIO pseudo-fs noexec · 22f6b4d3
      Jann Horn 提交于
      This ensures that do_mmap() won't implicitly make AIO memory mappings
      executable if the READ_IMPLIES_EXEC personality flag is set.  Such
      behavior is problematic because the security_mmap_file LSM hook doesn't
      catch this case, potentially permitting an attacker to bypass a W^X
      policy enforced by SELinux.
      
      I have tested the patch on my machine.
      
      To test the behavior, compile and run this:
      
          #define _GNU_SOURCE
          #include <unistd.h>
          #include <sys/personality.h>
          #include <linux/aio_abi.h>
          #include <err.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <sys/syscall.h>
      
          int main(void) {
              personality(READ_IMPLIES_EXEC);
              aio_context_t ctx = 0;
              if (syscall(__NR_io_setup, 1, &ctx))
                  err(1, "io_setup");
      
              char cmd[1000];
              sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
                  (int)getpid());
              system(cmd);
              return 0;
          }
      
      In the output, "rw-s" is good, "rwxs" is bad.
      Signed-off-by: NJann Horn <jann@thejh.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f6b4d3
  22. 24 5月, 2016 1 次提交
  23. 04 4月, 2016 1 次提交
  24. 05 9月, 2015 1 次提交
  25. 16 4月, 2015 1 次提交
    • J
      aio: fix serial draining in exit_aio() · dc48e56d
      Jens Axboe 提交于
      exit_aio() currently serializes killing io contexts. Each context
      killing ends up having to do percpu_ref_kill(), which in turns has
      to wait for an RCU grace period. This can take a long time, depending
      on the number of contexts. And there's no point in doing them serially,
      when we could be waiting for all of them in one fell swoop.
      
      This patches makes my fio thread offload test case exit 0.2s instead
      of almost 6s.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc48e56d
  26. 12 4月, 2015 8 次提交
  27. 07 4月, 2015 2 次提交
    • A
      ioctx_alloc(): fix vma (and file) leak on failure · deeb8525
      Al Viro 提交于
      If we fail past the aio_setup_ring(), we need to destroy the
      mapping.  We don't need to care about anybody having found ctx,
      or added requests to it, since the last failure exit is exactly
      the failure to make ctx visible to lookups.
      
      Reproducer (based on one by Joe Mario <jmario@redhat.com>):
      
      void count(char *p)
      {
      	char s[80];
      	printf("%s: ", p);
      	fflush(stdout);
      	sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
      	system(s);
      }
      
      int main()
      {
      	io_context_t *ctx;
      	int created, limit, i, destroyed;
      	FILE *f;
      
      	count("before");
      	if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
      		perror("opening aio-max-nr");
      	else if (fscanf(f, "%d", &limit) != 1)
      		fprintf(stderr, "can't parse aio-max-nr\n");
      	else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
      		perror("allocating aio_context_t array");
      	else {
      		for (i = 0, created = 0; i < limit; i++) {
      			if (io_setup(1000, ctx + created) == 0)
      				created++;
      		}
      		for (i = 0, destroyed = 0; i < created; i++)
      			if (io_destroy(ctx[i]) == 0)
      				destroyed++;
      		printf("created %d, failed %d, destroyed %d\n",
      			created, limit - created, destroyed);
      		count("after");
      	}
      }
      Found-by: NJoe Mario <jmario@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      deeb8525
    • A
      fix mremap() vs. ioctx_kill() race · b2edffdd
      Al Viro 提交于
      teach ->mremap() method to return an error and have it fail for
      aio mappings in process of being killed
      
      Note that in case of ->mremap() failure we need to undo move_page_tables()
      we'd already done; we could call ->mremap() first, but then the failure of
      move_page_tables() would require undoing whatever _successful_ ->mremap()
      has done, which would be a lot more headache in general.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2edffdd