1. 27 5月, 2020 11 次提交
  2. 14 5月, 2020 1 次提交
  3. 06 5月, 2020 1 次提交
  4. 30 4月, 2020 2 次提交
  5. 29 4月, 2020 6 次提交
    • A
      fix autofs regression caused by follow_managed() changes · 23b92780
      Al Viro 提交于
      fix #27211210
      
      commit 508c8772760d4ef9c1a044519b564710c3684fc5 upstream.
      
      we need to reload ->d_flags after the call of ->d_manage() - the thing
      might've been called with dentry still negative and have the damn thing
      turned positive while we'd waited.
      
      Fixes: d41efb522e90 "fs/namei.c: pull positivity check into follow_managed()"
      Reported-by: NIan Kent <raven@themaw.net>
      Tested-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      23b92780
    • A
      fs/namei.c: fix missing barriers when checking positivity · e2bdaed8
      Al Viro 提交于
      fix #27211210
      
      commit 2fa6b1e01a9b1a54769c394f06cd72c3d12a2d48 upstream.
      
      Pinned negative dentries can, generally, be made positive
      by another thread.  Conditions that prevent that are
      	* ->d_lock on dentry in question
      	* parent directory held at least shared
      	* nobody else could have observed the address of dentry
      Most of the places working with those fall into one of those
      categories; however, d_lookup() and friends need to be used
      with some care.  Fortunately, there's not a lot of call sites,
      and with few exceptions all of those fall under one of the
      cases above.
      
      Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
      and mountpoint_last().  Another one is lookup_slow() - there
      dcache lookup is done with parent held shared, but the result
      is used after we'd drop the lock.  The same happens in do_last() -
      the lookup (in lookup_one()) is done with parent locked, but
      result is used after unlocking.
      
      lookup_fast(), do_last() and mountpoint_last() flat-out reject
      negatives.
      
      Most of lookup_dcache() calls are made with parent locked at least
      shared; the only exception is lookup_one_len_unlocked().  It might
      return pinned negative, needs serious care from callers.  Fortunately,
      almost nobody calls it directly anymore; all but two callers have
      converted to lookup_positive_unlocked(), which rejects negatives.
      
      lookup_slow() is called by the same lookup_one_len_unlocked() (see
      above), mountpoint_last() and walk_component().  In those two negatives
      are rejected.
      
      In other words, there is a small set of places where we need to
      check carefully if a pinned potentially negative dentry is, in
      fact, positive.  After that check we want to be sure that both
      ->d_inode and type bits in ->d_flags are stable and observed.
      The set consists of follow_managed() (where the rejection happens
      for lookup_fast(), walk_component() and do_last()), last_mountpoint()
      and lookup_positive_unlocked().
      
      Solution:
      	1) transition from negative to positive (in __d_set_inode_and_type())
      stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
      	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
      smp_load_acquire() and bugger off if it type bits say "negative".
      That way anyone downstream of those checks has dentry know positive pinned,
      with ->d_inode and type bits of ->d_flags stable and observed.
      
      I considered splitting off d_lookup_positive(), so that the checks could
      be done right there, under ->d_lock.  However, that leads to massive
      duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
      worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
      No matter what, autofs_d_manage()/autofs_d_automount() must live with
      the possibility of pinned negative dentry passed their way, becoming
      positive under them - that's the intended behaviour when lookup comes
      in the middle of automount in progress, so we can't keep them out of
      the area that has to deal with those, more's the pity...
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e2bdaed8
    • A
      fix dget_parent() fastpath race · 2e197f58
      Al Viro 提交于
      fix #27211210
      
      commit e84009336711d2bba885fc9cea66348ddfce3758 upstream.
      
      We are overoptimistic about taking the fast path there; seeing
      the same value in ->d_parent after having grabbed a reference
      to that parent does *not* mean that it has remained our parent
      all along.
      
      That wouldn't be a big deal (in the end it is our parent and
      we have grabbed the reference we are about to return), but...
      the situation with barriers is messed up.
      
      We might have hit the following sequence:
      
      d is a dentry of /tmp/a/b
      CPU1:					CPU2:
      parent = d->d_parent (i.e. dentry of /tmp/a)
      					rename /tmp/a/b to /tmp/b
      					rmdir /tmp/a, making its dentry negative
      grab reference to parent,
      end up with cached parent->d_inode (NULL)
      					mkdir /tmp/a, rename /tmp/b to /tmp/a/b
      recheck d->d_parent, which is back to original
      decide that everything's fine and return the reference we'd got.
      
      The trouble is, caller (on CPU1) will observe dget_parent()
      returning an apparently negative dentry.  It actually is positive,
      but CPU1 has stale ->d_inode cached.
      
      Use d->d_seq to see if it has been moved instead of rechecking ->d_parent.
      NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch;
      we just go into the slow path in such case.  We don't wait for ->d_seq
      to become even either - again, if we are racing with renames, we
      can bloody well go to slow path anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2e197f58
    • A
      new helper: lookup_positive_unlocked() · ec6880e8
      Al Viro 提交于
      fix #27211210
      
      commit 6c2d4798a8d16cf4f3a28c3cd4af4f1dcbbb4d04 upstream.
      
      Most of the callers of lookup_one_len_unlocked() treat negatives are
      ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
      that a pinned positive dentry remains positive - it's ->d_inode is
      stable, etc.; a pinned _negative_ dentry can become positive at any
      point as long as you are not holding its parent at least shared.
      So using lookup_one_len_unlocked() needs to be careful;
      lookup_positive_unlocked() is safer and that's what the callers
      end up open-coding anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ec6880e8
    • A
      fs/namei.c: pull positivity check into follow_managed() · da2c0773
      Al Viro 提交于
      fix #27211210
      
      commit d41efb522e902364ab09c782d511c1bedc388ddd upstream.
      
      There are 4 callers; two proceed to check if result is positive and
      fail with ENOENT if it isn't; one (in handle_lookup_down()) is
      guaranteed to yield positive and one (in lookup_fast()) is _preceded_
      by positivity check.
      
      However, follow_managed() on a negative dentry is a (fairly cheap)
      no-op on anything other than autofs.  And negative autofs dentries
      are never hashed, so lookup_fast() is not going to run into one
      of those.  Moreover, successful follow_managed() on a _positive_
      dentry never yields a negative one (and we significantly rely upon
      that in callers of lookup_fast()).
      
      In other words, we can easily transpose the positivity check and
      the call of follow_managed() in lookup_fast().  And that allows
      to fold the positivity check *into* follow_managed(), simplifying
      life for the code downstream of its calls.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      da2c0773
    • J
      ovl: inherit SB_NOSEC flag from upperdir · 8f93d2ca
      Jeffle Xu 提交于
      to #23113286
      
      Since the stacking of regular file operations [1], the overlayfs edition of
      write_iter() is called when writing regular files.
      
      Since then, xattr lookup is needed on every write since file_remove_privs()
      is called from ovl_write_iter(), which would become the performance
      bottleneck when writing small chunks of data. In my test case,
      file_remove_privs() would consume ~15% CPU when running fstime of unixbench
      (the workload is repeadly writing 1 KB to the same file) [2].
      
      Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be
      done only once on the first write. Unixbench fstime gets a ~20% performance
      gain with this patch.
      
      [1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/
      [2] https://www.spinics.net/lists/linux-unionfs/msg07153.htmlSigned-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git/commit/?h=overlayfs-next&id=b6dee44c57c785a59ef5f1f71588d13ebd89d395Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8f93d2ca
  6. 24 4月, 2020 2 次提交
  7. 02 4月, 2020 1 次提交
    • J
      io_uring: use current task creds instead of allocating a new one · 311b786d
      Jens Axboe 提交于
      fix #26374723
      
      commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a upstream.
      
      syzbot reports:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
        kthread+0x361/0x430 kernel/kthread.c:255
        ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Modules linked in:
      ---[ end trace f2e1a4307fbe2245 ]---
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is caused by slab fault injection triggering a failure in
      prepare_creds(). We don't actually need to create a copy of the creds
      as we're not modifying it, we just need a reference on the current task
      creds. This avoids the failure case as well, and propagates the const
      throughout the stack.
      
      Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
      Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      311b786d
  8. 26 3月, 2020 1 次提交
  9. 20 3月, 2020 2 次提交
    • A
      vfs: fix do_last() regression · 6073719d
      Al Viro 提交于
      commit 6404674acd596de41fd3ad5f267b4525494a891a upstream
      
      Brown paperbag time: fetching ->i_uid/->i_mode really should've been
      done from nd->inode.  I even suggested that, but the reason for that has
      slipped through the cracks and I went for dir->d_inode instead - made
      for more "obvious" patch.
      
      Analysis:
      
       - at the entry into do_last() and all the way to step_into(): dir (aka
         nd->path.dentry) is known not to have been freed; so's nd->inode and
         it's equal to dir->d_inode unless we are already doomed to -ECHILD.
         inode of the file to get opened is not known.
      
       - after step_into(): inode of the file to get opened is known; dir
         might be pointing to freed memory/be negative/etc.
      
       - at the call of may_create_in_sticky(): guaranteed to be out of RCU
         mode; inode of the file to get opened is known and pinned; dir might
         be garbage.
      
      The last was the reason for the original patch.  Except that at the
      do_last() entry we can be in RCU mode and it is possible that
      nd->path.dentry->d_inode has already changed under us.
      
      In that case we are going to fail with -ECHILD, but we need to be
      careful; nd->inode is pointing to valid struct inode and it's the same
      as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
      should use that.
      Reported-by: N"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
      Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
      Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6073719d
    • J
      io-wq: wait for io_wq_create() to setup necessary workers · 4c628e9d
      Jens Axboe 提交于
      commit b60fda6000a99a7ccac36005ab78b14b47c06de3 upstream
      
      We currently have a race where if setup is really slow, we can be
      calling io_wq_destroy() before we're done setting up. This will cause
      the caller to get stuck waiting for the manager to set things up, but
      the manager already exited.
      
      Fix this by doing a sync setup of the manager. This also fixes the case
      where if we failed creating workers, we'd also get stuck.
      
      In practice this race window was really small, as we already wait for
      the manager to start. Hence someone would have to call io_wq_destroy()
      after the task has started, but before it started the first loop. The
      reported test case forked tons of these, which is why it became an
      issue.
      
      Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
      Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4c628e9d
  10. 19 3月, 2020 3 次提交
  11. 18 3月, 2020 10 次提交
    • A
      do_last(): fetch directory ->i_mode and ->i_uid before it's too late · 98ab6ba3
      Al Viro 提交于
      commit d0cb50185ae942b03c4327be322055d622dc79f6 upstream.
      
      [ Fixes: CVE-2020-8428 ]
      
      may_create_in_sticky() call is done when we already have dropped the
      reference to dir.
      
      Fixes: 30aba665 (namei: allow restricted O_CREAT of FIFOs and regular files)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      98ab6ba3
    • X
      io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled · f1046eaf
      Xiaoguang Wang 提交于
      commit 32b2244a840a90ea94ba42392de5c48d53f521f5 upstream linux-next
      
      When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
      to do io completion events polling again, they can rely on io_sq_thread to do
      polling work, which can reduce cpu usage and uring_lock contention.
      
      I modify fio io_uring engine codes a bit to evaluate the performance:
      static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                              continue;
                      }
      
      -               if (!o->sqpoll_thread) {
      +               if (o->sqpoll_thread && o->hipri) {
                              r = io_uring_enter(ld, 0, actual_min,
                                                      IORING_ENTER_GETEVENTS);
                              if (r < 0) {
      
      and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
      -rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
      -size=10G -numjobs=1  -time_based -runtime=120"
      
      original codes
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
      fio cpu usage |     100% |     100% |     100% |     100% |     100%
      --------------------------------------------------------------------
      
      with patch
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
      fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
      --------------------------------------------------------------------
      bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
      --------------------------------------------------------------------
      
      From above test results, we can see that bw has above 5.5%~13%
      improvement, and fio process's cpu usage also drops much. Note this
      won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
      are both enabled, in this case, io_sq_thread always has 100% cpu usage.
      I think this patch will be friendly to applications which will often use
      io_uring_wait_cqe() or similar from liburing.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f1046eaf
    • J
      iomap: Allow forcing of waiting for running DIO in iomap_dio_rw() · 96d39291
      Jan Kara 提交于
      commit 13ef954445df4fd1d7c003a500ec5ce49573e14b upstream
      
      Notes from Xiaoguang Wang:
          Indeed this patch should be appled before "ext4: introduce direct I/O
      read using iomap infrastructure", but given that we have already appled
      "ext4: introduce direct I/O read using iomap infrastructure" previously,
      we need to update iomap_dio_rw() calls with the new argument  in ext4.
      
      Filesystems do not support doing IO as asynchronous in some cases. For
      example in case of unaligned writes or in case file size needs to be
      extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
      in such cases, add argument to iomap_dio_rw() which makes the function
      wait for IO completion. This also results in executing
      iomap_dio_complete() inline in iomap_dio_rw() providing its return value
      to the caller as for ordinary sync IO.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      96d39291
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · a3a1829e
      Xiaoguang Wang 提交于
      commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 upstream
      
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3a1829e
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · a715bf8d
      Xiaoguang Wang 提交于
      commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream.
      
      Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending")
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a715bf8d
    • X
      ext4: start to support iopoll method · 7f8fa198
      Xiaoguang Wang 提交于
      Since commit "b1b4705d54ab ext4: introduce direct I/O read using
      iomap infrastructure", we can easily make ext4 support iopoll
      method, just use iomap_dio_iopoll().
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7f8fa198
    • R
      ext4: Move to shared i_rwsem even without dioread_nolock mount opt · bde24335
      Ritesh Harjani 提交于
      commit bc6385dab125d20870f0eb9ca9e589f43abb3f56 upstream.
      
      We were using shared locking only in case of dioread_nolock mount option in case
      of DIO overwrites. This mount condition is not needed anymore with current code,
      since:-
      
      1. No race between buffered writes & DIO overwrites. Since buffIO writes takes
      exclusive lock & DIO overwrites will take shared locking. Also DIO path will
      make sure to flush and wait for any dirty page cache data.
      
      2. No race between buffered reads & DIO overwrites, since there is no block
      allocation that is possible with DIO overwrites. So no stale data exposure
      should happen. Same is the case between DIO reads & DIO overwrites.
      
      3. Also other paths like truncate is protected, since we wait there for any DIO
      in flight to be over.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bde24335
    • R
      ext4: Start with shared i_rwsem in case of DIO instead of exclusive · 3444040a
      Ritesh Harjani 提交于
      commit aa9714d0e39788d0688474c9d5f6a9a36159599f upstream.
      
      Earlier there was no shared lock in DIO read path. But this patch
      (16c54688: ext4: Allow parallel DIO reads)
      simplified some of the locking mechanism while still allowing for parallel DIO
      reads by adding shared lock in inode DIO read path.
      
      But this created problem with mixed read/write workload. It is due to the fact
      that in DIO path, we first start with exclusive lock and only when we determine
      that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
      we still have shared locking in DIO reads.
      
      So, this patch tries to fix this issue by starting with shared lock and then
      switching to exclusive lock only when required based on ext4_dio_write_checks().
      
      Other than that, it also simplifies below cases:-
      
      1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
      abused in the sense that it was not really checking for AIO anywhere also it
      used to check for extending writes. So this API was renamed and simplified to
      ext4_unaligned_io() which actully only checks if the IO is really unaligned.
      
      Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
      block and that will require serialization against other direct IOs in the same
      block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
      we also need to wait for any outstanding IOs to complete so that conversion from
      unwritten to written is completed before anyone try to map the overlapping block.
      Hence we take exclusive inode lock and also wait for inode_dio_wait() for
      unaligned DIO case. Please note since we are anyway taking an exclusive lock in
      unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
      
      2. Added ext4_extending_io(). This checks if the IO is extending the file.
      
      3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
      only switch to exclusive lock if required. So in most cases with aligned,
      non-extending, dioread_nolock & overwrites, it tries to write with a shared
      lock. If not, then we restart the operation in ext4_dio_write_checks(), after
      acquiring exclusive lock.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3444040a
    • R
      ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT · f0afa64f
      Ritesh Harjani 提交于
      commit f629afe3369e9885fd6e9cc7a4f514b6a65cf9e9 upstream.
      
      Apparently our current rwsem code doesn't like doing the trylock, then
      lock for real scheme.  So change our dax read/write methods to just do the
      trylock for the RWF_NOWAIT case.
      This seems to fix AIM7 regression in some scalable filesystems upto ~25%
      in some cases. Claimed in commit 942491c9 ("xfs: fix AIM7 regression")
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
      Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f0afa64f
    • M
      ext4: introduce direct I/O write using iomap infrastructure · 683d3d57
      Matthew Bobrowski 提交于
      commit 378f32bab3714f04c4e0c3aee4129f6703805550 upstream.
      
      This patch introduces a new direct I/O write path which makes use of
      the iomap infrastructure.
      
      All direct I/O writes are now passed from the ->write_iter() callback
      through to the new direct I/O handler ext4_dio_write_iter(). This
      function is responsible for calling into the iomap infrastructure via
      iomap_dio_rw().
      
      Code snippets from the existing direct I/O write code within
      ext4_file_write_iter() such as, checking whether the I/O request is
      unaligned asynchronous I/O, or whether the write will result in an
      overwrite have effectively been moved out and into the new direct I/O
      ->write_iter() handler.
      The block mapping flags that are eventually passed down to
      ext4_map_blocks() from the *_get_block_*() suite of routines have been
      taken out and introduced within ext4_iomap_alloc().
      
      For inode extension cases, ext4_handle_inode_extension() is
      effectively the function responsible for performing such metadata
      updates. This is called after iomap_dio_rw() has returned so that we
      can safely determine whether we need to potentially truncate any
      allocated blocks that may have been prepared for this direct I/O
      write. We don't perform the inode extension, or truncate operations
      from the ->end_io() handler as we don't have the original I/O 'length'
      available there. The ->end_io() however is responsible fo converting
      allocated unwritten extents to written extents.
      
      In the instance of a short write, we fallback and complete the
      remainder of the I/O using buffered I/O via
      ext4_buffered_write_iter().
      
      The existing buffer_head direct I/O implementation has been removed as
      it's now redundant.
      
      [ Fix up ext4_dio_write_iter() per Jan's comments at
        https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ]
      Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      683d3d57