1. 07 12月, 2020 1 次提交
  2. 01 12月, 2020 1 次提交
  3. 26 11月, 2020 1 次提交
    • P
      io_uring: fix files grab/cancel race · af604703
      Pavel Begunkov 提交于
      When one task is in io_uring_cancel_files() and another is doing
      io_prep_async_work() a race may happen. That's because after accounting
      a request inflight in first call to io_grab_identity() it still may fail
      and go to io_identity_cow(), which migh briefly keep dangling
      work.identity and not only.
      
      Grab files last, so io_prep_async_work() won't fail if it did get into
      ->inflight_list.
      
      note: the bug shouldn't exist after making io_uring_cancel_files() not
      poking into other tasks' requests.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af604703
  4. 24 11月, 2020 2 次提交
    • P
      io_uring: fix ITER_BVEC check · 9c3a205c
      Pavel Begunkov 提交于
      iov_iter::type is a bitmask that also keeps direction etc., so it
      shouldn't be directly compared against ITER_*. Use proper helper.
      
      Fixes: ff6165b2 ("io_uring: retain iov_iter state over io_read/io_write calls")
      Reported-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Cc: <stable@vger.kernel.org> # 5.9
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9c3a205c
    • J
      io_uring: fix shift-out-of-bounds when round up cq size · eb2667b3
      Joseph Qi 提交于
      Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create():
      
      [ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
      [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int'
      [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3
      [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 59.603673] Call Trace:
      [ 59.604286] dump_stack+0x107/0x163
      [ 59.605237] ubsan_epilogue+0xb/0x5a
      [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e
      [ 59.607335] ? lock_downgrade+0x6c0/0x6c0
      [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0
      [ 59.609166] io_uring_create.cold+0x99/0x149
      [ 59.610114] io_uring_setup+0xd6/0x140
      [ 59.610975] ? io_uring_create+0x2510/0x2510
      [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400
      [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80
      [ 59.614038] ? trace_hardirqs_on+0x5b/0x180
      [ 59.615056] do_syscall_64+0x2d/0x40
      [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 59.617007] RIP: 0033:0x7f2bb8a0b239
      
      This is caused by roundup_pow_of_two() if the input entries larger
      enough, e.g. 2^32-1. For sq_entries, it will check first and we allow
      at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do
      round up first, that may overflow and truncate it to 0, which is not
      the expected behavior. So check the cq size first and then do round up.
      
      Fixes: 88ec3211 ("io_uring: round-up cq size before comparing with rounded sq size")
      Reported-by: NAbaci Fuzz <abaci@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb2667b3
  5. 23 11月, 2020 2 次提交
    • D
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells 提交于
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
    • Y
      libfs: fix error cast of negative value in simple_attr_write() · 488dac0c
      Yicong Yang 提交于
      The attr->set() receive a value of u64, but simple_strtoll() is used for
      doing the conversion.  It will lead to the error cast if user inputs a
      negative value.
      
      Use kstrtoull() instead of simple_strtoll() to convert a string got from
      the user to an unsigned value.  The former will return '-EINVAL' if it
      gets a negetive value, but the latter can't handle the situation
      correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
      this will eliminate the compile warning on no 64-bit architectures.
      
      Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
      Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488dac0c
  6. 20 11月, 2020 5 次提交
    • J
      ext4: fix bogus warning in ext4_update_dx_flag() · f902b216
      Jan Kara 提交于
      The idea of the warning in ext4_update_dx_flag() is that we should warn
      when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
      checksums enabled since after clearing the flag, checksums for internal
      htree nodes will become invalid. So there's no need to warn (or actually
      do anything) when EXT4_INODE_INDEX is not set.
      
      Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
      Fixes: 48a34311 ("ext4: fix checksum errors with indexed dirs")
      Reported-by: NEric Biggers <ebiggers@kernel.org>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      f902b216
    • M
      jbd2: fix kernel-doc markups · 2bf31d94
      Mauro Carvalho Chehab 提交于
      Kernel-doc markup should use this format:
              identifier - description
      
      They should not have any type before that, as otherwise
      the parser won't do the right thing.
      
      Also, some identifiers have different names between their
      prototypes and the kernel-doc markup.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      2bf31d94
    • D
      xfs: revert "xfs: fix rmap key and record comparison functions" · eb840907
      Darrick J. Wong 提交于
      This reverts commit 6ff646b2.
      
      Your maintainer committed a major braino in the rmap code by adding the
      attr fork, bmbt, and unwritten extent usage bits into rmap record key
      comparisons.  While XFS uses the usage bits *in the rmap records* for
      cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
      the owner and offset information to distinguish between reverse mappings
      of the same physical extent into the data fork of a file at multiple
      offsets.  The other bits are not important for key comparisons for index
      lookups, and never have been.
      
      Eric Sandeen reports that this causes regressions in generic/299, so
      undo this patch before it does more damage.
      Reported-by: NEric Sandeen <sandeen@sandeen.net>
      Fixes: 6ff646b2 ("xfs: fix rmap key and record comparison functions")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      eb840907
    • T
      ext4: drop fast_commit from /proc/mounts · 704c2317
      Theodore Ts'o 提交于
      The options in /proc/mounts must be valid mount options --- and
      fast_commit is not a mount option.  Otherwise, command sequences like
      this will fail:
      
          # mount /dev/vdc /vdc
          # mkdir -p /vdc/phoronix_test_suite /pts
          # mount --bind /vdc/phoronix_test_suite /pts
          # mount -o remount,nodioread_nolock /pts
          mount: /pts: mount point not mounted or bad option.
      
      And in the system logs, you'll find:
      
          EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value
      
      Fixes: 995a3ed6 ("ext4: add fast_commit feature and handling for extended mount options")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      704c2317
    • D
      xfs: don't allow NOWAIT DIO across extent boundaries · 883a790a
      Dave Chinner 提交于
      Jens has reported a situation where partial direct IOs can be issued
      and completed yet still return -EAGAIN. We don't want this to report
      a short IO as we want XFS to complete user DIO entirely or not at
      all.
      
      This partial IO situation can occur on a write IO that is split
      across an allocated extent and a hole, and the second mapping is
      returning EAGAIN because allocation would be required.
      
      The trivial reproducer:
      
      $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
      wrote 4096/4096 bytes at offset 0
      4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
      pwrite: Resource temporarily unavailable
      $
      
      The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
      the first 4kB write:
      
       xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
       iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
       xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
       iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
      
      Here the first iomap loop has mapped the first 4kB of the file and
      issued the IO, and we enter the second iomap_apply loop:
      
       iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
      
      And we exit with -EAGAIN out because we hit the allocate case trying
      to make the second 4kB block.
      
      Then IO completes on the first 4kB and the original IO context
      completes and unlocks the inode, returning -EAGAIN to userspace:
      
       xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
       xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
      
      There are other vectors to the same problem when we re-enter the
      mapping code if we have to make multiple mappinfs under NOWAIT
      conditions. e.g. failing trylocks, COW extents being found,
      allocation being required, and so on.
      
      Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
      to go ahead if the mapping we retrieve for the IO spans an entire
      allocated extent. This avoids the possibility of subsequent mappings
      to complete the IO from triggering NOWAIT semantics by any means as
      NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
      Reported-and-tested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      883a790a
  7. 19 11月, 2020 6 次提交
  8. 18 11月, 2020 4 次提交
    • B
      gfs2: Fix regression in freeze_go_sync · 20b32912
      Bob Peterson 提交于
      Patch 541656d3 ("gfs2: freeze should work on read-only mounts") changed
      the check for glock state in function freeze_go_sync() from "gl->gl_state
      == LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE".  That's wrong and it
      regressed gfs2's freeze/thaw mechanism because it caused only the freezing
      node (which requests the glock in EX) to queue freeze work.
      
      All nodes go through this go_sync code path during the freeze to drop their
      SHared hold on the freeze glock, allowing the freezing node to acquire it
      in EXclusive mode. But all the nodes must freeze access to the file system
      locally, so they ALL must queue freeze work. The freeze_work calls
      freeze_func, which makes a request to reacquire the freeze glock in SH,
      effectively blocking until the thaw from the EX holder. Once thawed, the
      freezing node drops its EX hold on the freeze glock, then the (blocked)
      freeze_func reacquires the freeze glock in SH again (on all nodes, including
      the freezer) so all nodes go back to a thawed state.
      
      This patch changes the check back to gl_state == LM_ST_SHARED like it was
      prior to 541656d3.
      
      Fixes: 541656d3 ("gfs2: freeze should work on read-only mounts")
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      20b32912
    • P
      io_uring: order refnode recycling · e297822b
      Pavel Begunkov 提交于
      Don't recycle a refnode until we're done with all requests of nodes
      ejected before.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e297822b
    • P
      io_uring: get an active ref_node from files_data · 1e5d770b
      Pavel Begunkov 提交于
      An active ref_node always can be found in ctx->files_data, it's much
      safer to get it this way instead of poking into files_data->ref_list.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e5d770b
    • J
      io_uring: don't double complete failed reissue request · c993df5a
      Jens Axboe 提交于
      Zorro reports that an xfstest test case is failing, and it turns out that
      for the reissue path we can potentially issue a double completion on the
      request for the failure path. There's an issue around the retry as well,
      but for now, at least just make sure that we handle the error path
      correctly.
      
      Cc: stable@vger.kernel.org
      Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c993df5a
  9. 15 11月, 2020 3 次提交
    • D
      afs: Fix afs_write_end() when called with copied == 0 [ver #3] · 3ad216ee
      David Howells 提交于
      When afs_write_end() is called with copied == 0, it tries to set the
      dirty region, but there's no way to actually encode a 0-length region in
      the encoding in page->private.
      
      "0,0", for example, indicates a 1-byte region at offset 0.  The maths
      miscalculates this and sets it incorrectly.
      
      Fix it to just do nothing but unlock and put the page in this case.  We
      don't actually need to mark the page dirty as nothing presumably
      changed.
      
      Fixes: 65dd2d60 ("afs: Alter dirty range encoding in page->private")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad216ee
    • W
      ocfs2: initialize ip_next_orphan · f5785283
      Wengang Wang 提交于
      Though problem if found on a lower 4.1.12 kernel, I think upstream has
      same issue.
      
      In one node in the cluster, there is the following callback trace:
      
         # cat /proc/21473/stack
         __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
         ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
         ocfs2_evict_inode+0x152/0x820 [ocfs2]
         evict+0xae/0x1a0
         iput+0x1c6/0x230
         ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
         ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
         ocfs2_dir_foreach+0x29/0x30 [ocfs2]
         ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
         ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
         process_one_work+0x169/0x4a0
         worker_thread+0x5b/0x560
         kthread+0xcb/0xf0
         ret_from_fork+0x61/0x90
      
      The above stack is not reasonable, the final iput shouldn't happen in
      ocfs2_orphan_filldir() function.  Looking at the code,
      
        2067         /* Skip inodes which are already added to recover list, since dio may
        2068          * happen concurrently with unlink/rename */
        2069         if (OCFS2_I(iter)->ip_next_orphan) {
        2070                 iput(iter);
        2071                 return 0;
        2072         }
        2073
      
      The logic thinks the inode is already in recover list on seeing
      ip_next_orphan is non-NULL, so it skip this inode after dropping a
      reference which incremented in ocfs2_iget().
      
      While, if the inode is already in recover list, it should have another
      reference and the iput() at line 2070 should not be the final iput
      (dropping the last reference).  So I don't think the inode is really in
      the recover list (no vmcore to confirm).
      
      Note that ocfs2_queue_orphans(), though not shown up in the call back
      trace, is holding cluster lock on the orphan directory when looking up
      for unlinked inodes.  The on disk inode eviction could involve a lot of
      IOs which may need long time to finish.  That means this node could hold
      the cluster lock for very long time, that can lead to the lock requests
      (from other nodes) to the orhpan directory hang for long time.
      
      Looking at more on ip_next_orphan, I found it's not initialized when
      allocating a new ocfs2_inode_info structure.
      
      This causes te reflink operations from some nodes hang for very long
      time waiting for the cluster lock on the orphan directory.
      
      Fix: initialize ip_next_orphan as NULL.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5785283
    • J
      io_uring: handle -EOPNOTSUPP on path resolution · 944d1444
      Jens Axboe 提交于
      Any attempt to do path resolution on /proc/self from an async worker will
      yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
      and without blocking, so retry it from there.
      
      Ideally io_uring would know this upfront and not have to go through the
      worker thread to find out, but that doesn't currently seem feasible.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      944d1444
  10. 14 11月, 2020 1 次提交
  11. 13 11月, 2020 2 次提交
    • B
      gfs2: Fix case in which ail writes are done to jdata holes · 4e79e3f0
      Bob Peterson 提交于
      Patch b2a846db ("gfs2: Ignore journal log writes for jdata holes")
      tried (unsuccessfully) to fix a case in which writes were done to jdata
      blocks, the blocks are sent to the ail list, then a punch_hole or truncate
      operation caused the blocks to be freed. In other words, the ail items
      are for jdata holes. Before b2a846db, the jdata hole caused function
      gfs2_block_map to return -EIO, which was eventually interpreted as an
      IO error to the journal, and then withdraw.
      
      This patch changes function gfs2_get_block_noalloc, which is only used
      for jdata writes, so it returns -ENODATA rather than -EIO, and when
      -ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
      We can safely ignore it because gfs2_ail1_start_one is only called
      when the jdata pages have already been written and truncated, so the
      ail1 content no longer applies.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      4e79e3f0
    • B
      Revert "gfs2: Ignore journal log writes for jdata holes" · d3039c06
      Bob Peterson 提交于
      This reverts commit b2a846db.
      
      That commit changed the behavior of function gfs2_block_map to return
      -ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
      false.  While that fixed the intended problem for jdata, it also broke
      other callers of gfs2_block_map such as some jdata block reads.  Before
      the patch, an encountered hole would be skipped and the buffer seen as
      unmapped by the caller.  The patch changed the behavior to return
      -ENODATA, which is interpreted as an error by the caller.
      
      The -ENODATA return code should be restricted to the specific case where
      jdata holes are encountered during ail1 writes.  That will be done in a
      later patch.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      d3039c06
  12. 12 11月, 2020 10 次提交
  13. 11 11月, 2020 2 次提交