1. 13 9月, 2019 1 次提交
    • J
      io_uring: extend async work merging · 6d5d5ac5
      Jens Axboe 提交于
      We currently merge async work items if we see a strict sequential hit.
      This helps avoid unnecessary workqueue switches when we don't need
      them. We can extend this merging to cover cases where it's not a strict
      sequential hit, but the IO still fits within the same page. If an
      application is doing multiple requests within the same page, we don't
      want separate workers waiting on the same page to complete IO. It's much
      faster to let the first worker bring in the page, then operate on that
      page from the same worker to complete the next request(s).
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5d5ac5
  2. 10 9月, 2019 5 次提交
    • J
      io_uring: limit parallelism of buffered writes · 54a91f3b
      Jens Axboe 提交于
      All the popular filesystems need to grab the inode lock for buffered
      writes. With io_uring punting buffered writes to async context, we
      observe a lot of contention with all workers hamming this mutex.
      
      For buffered writes, we generally don't need a lot of parallelism on
      the submission side, as the flushing will take care of that for us.
      Hence we don't need a deep queue on the write side, as long as we
      can safely punt from the original submission context.
      
      Add a workqueue with a limit of 2 that we can use for buffered writes.
      This greatly improves the performance and efficiency of higher queue
      depth buffered async writes with io_uring.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54a91f3b
    • J
      io_uring: add io_queue_async_work() helper · 18d9be1a
      Jens Axboe 提交于
      Add a helper for queueing a request for async execution, in preparation
      for optimizing it.
      
      No functional change in this patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      18d9be1a
    • J
      io_uring: optimize submit_and_wait API · c5766668
      Jens Axboe 提交于
      For some applications that end up using a submit-and-wait type of
      approach for certain batches of IO, we can make that a bit more
      efficient by allowing the application to block for the last IO
      submission. This prevents an async when we don't need it, as the
      application will be blocking for the completion event(s) anyway.
      
      Typical use cases are using the liburing
      io_uring_submit_and_wait() API, or just using io_uring_enter()
      doing both submissions and completions. As a specific example,
      RocksDB doing MultiGet() is sped up quite a bit with this
      change.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5766668
    • J
      io_uring: add support for link with drain · 4fe2c963
      Jackie Liu 提交于
      To support the link with drain, we need to do two parts.
      
      There is an sqes:
      
          0     1     2     3     4     5     6
       +-----+-----+-----+-----+-----+-----+-----+
       |  N  |  L  |  L  | L+D |  N  |  N  |  N  |
       +-----+-----+-----+-----+-----+-----+-----+
      
      First, we need to ensure that the io before the link is completed,
      there is a easy way is set drain flag to the link list's head, so
      all subsequent io will be inserted into the defer_list.
      
      	+-----+
          (0) |  N  |
      	+-----+
                 |          (2)         (3)         (4)
      	+-----+     +-----+     +-----+     +-----+
          (1) | L+D | --> |  L  | --> | L+D | --> |  N  |
      	+-----+     +-----+     +-----+     +-----+
                 |
      	+-----+
          (5) |  N  |
      	+-----+
                 |
      	+-----+
          (6) |  N  |
      	+-----+
      
      Second, ensure that the following IO will not be completed first,
      an easy way is to create a mirror of drain io and insert it into
      defer_list, in this way, as long as drain io is not processed, the
      following io in the defer_list will not be actively process.
      
      	+-----+
          (0) |  N  |
      	+-----+
                 |          (2)         (3)         (4)
      	+-----+     +-----+     +-----+     +-----+
          (1) | L+D | --> |  L  | --> | L+D | --> |  N  |
      	+-----+     +-----+     +-----+     +-----+
                 |
      	+-----+
         ('3) |  D  |   <== This is a shadow of (3)
      	+-----+
                 |
      	+-----+
          (5) |  N  |
      	+-----+
                 |
      	+-----+
          (6) |  N  |
      	+-----+
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4fe2c963
    • J
      io_uring: fix wrong sequence setting logic · 8776f3fa
      Jackie Liu 提交于
      Sqo_thread will get sqring in batches, which will cause
      ctx->cached_sq_head to be added in batches. if one of these
      sqes is set with the DRAIN flag, then he will never get a
      chance to process, and finally sqo_thread will not exit.
      
      Fixes: de0617e4 ("io_uring: add support for marking commands as draining")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8776f3fa
  3. 07 9月, 2019 1 次提交
    • J
      io_uring: expose single mmap capability · ac90f249
      Jens Axboe 提交于
      After commit 75b28aff we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac90f249
  4. 28 8月, 2019 2 次提交
  5. 25 8月, 2019 1 次提交
  6. 23 8月, 2019 2 次提交
    • D
      xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT · 1fb254aa
      Darrick J. Wong 提交于
      Benjamin Moody reported to Debian that XFS partially wedges when a chgrp
      fails on account of being out of disk quota.  I ran his reproducer
      script:
      
      # adduser dummy
      # adduser dummy plugdev
      
      # dd if=/dev/zero bs=1M count=100 of=test.img
      # mkfs.xfs test.img
      # mount -t xfs -o gquota test.img /mnt
      # mkdir -p /mnt/dummy
      # chown -c dummy /mnt/dummy
      # xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt
      
      (and then as user dummy)
      
      $ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo
      $ chgrp plugdev /mnt/dummy/foo
      
      and saw:
      
      ================================================
      WARNING: lock held when returning to user space!
      5.3.0-rc5 #rc5 Tainted: G        W
      ------------------------------------------------
      chgrp/47006 is leaving the kernel with locks still held!
      1 lock held by chgrp/47006:
       #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs]
      
      ...which is clearly caused by xfs_setattr_nonsize failing to unlock the
      ILOCK after the xfs_qm_vop_chown_reserve call fails.  Add the missing
      unlock.
      
      Reported-by: benjamin.moody@gmail.com
      Fixes: 253f4911 ("xfs: better xfs_trans_alloc interface")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NSalvatore Bonaccorso <carnil@debian.org>
      1fb254aa
    • J
      io_uring: add need_resched() check in inner poll loop · 08f5439f
      Jens Axboe 提交于
      The outer poll loop checks for whether we need to reschedule, and
      returns to userspace if we do. However, it's possible to get stuck
      in the inner loop as well, if the CPU we are running on needs to
      reschedule to finish the IO work.
      
      Add the need_resched() check in the inner loop as well. This fixes
      a potential hang if the kernel is configured with
      CONFIG_PREEMPT_VOLUNTARY=y.
      Reported-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08f5439f
  7. 22 8月, 2019 11 次提交
    • L
      ubifs: Limit the number of pages in shrink_liability · 0af83abb
      Liu Song 提交于
      If the number of dirty pages to be written back is large,
      then writeback_inodes_sb will block waiting for a long time,
      causing hung task detection alarm. Therefore, we should limit
      the maximum number of pages written back this time, which let
      the budget be completed faster. The remaining dirty pages
      tend to rely on the writeback mechanism to complete the
      synchronization.
      
      Fixes: b6e51316 ("writeback: separate starting of sync vs opportunistic writeback")
      Signed-off-by: NLiu Song <liu.song11@zte.com.cn>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      0af83abb
    • R
      ubifs: Correctly initialize c->min_log_bytes · 377e208f
      Richard Weinberger 提交于
      Currently on a freshly mounted UBIFS, c->min_log_bytes is 0.
      This can lead to a log overrun and make commits fail.
      
      Recent kernels will report the following assert:
      UBIFS assert failed: c->lhead_lnum != c->ltail_lnum, in fs/ubifs/log.c:412
      
      c->min_log_bytes can have two states, 0 and c->leb_size.
      It controls how much bytes of the log area are reserved for non-bud
      nodes such as commit nodes.
      
      After a commit it has to be set to c->leb_size such that we have always
      enough space for a commit. While a commit runs it can be 0 to make the
      remaining bytes of the log available to writers.
      
      Having it set to 0 right after mount is wrong since no space for commits
      is reserved.
      
      Fixes: 1e51764a ("UBIFS: add new flash file system")
      Reported-and-tested-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      377e208f
    • R
      ubifs: Fix double unlock around orphan_delete() · 4dd75b33
      Richard Weinberger 提交于
      We unlock after orphan_delete(), so no need to unlock
      in the function too.
      Reported-by: NHan Xu <han.xu@nxp.com>
      Fixes: 8009ce95 ("ubifs: Don't leak orphans on memory during commit")
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      4dd75b33
    • Y
      afs: use correct afs_call_type in yfs_fs_store_opaque_acl2 · 7533be85
      YueHaibing 提交于
      It seems that 'yfs_RXYFSStoreOpaqueACL2' should be use in
      yfs_fs_store_opaque_acl2().
      
      Fixes: f5e45463 ("afs: Implement YFS ACL setting")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7533be85
    • M
      afs: Fix possible oops in afs_lookup trace event · c4c613ff
      Marc Dionne 提交于
      The afs_lookup trace event can cause the following:
      
      [  216.576777] BUG: kernel NULL pointer dereference, address: 000000000000023b
      [  216.576803] #PF: supervisor read access in kernel mode
      [  216.576813] #PF: error_code(0x0000) - not-present page
      ...
      [  216.576913] RIP: 0010:trace_event_raw_event_afs_lookup+0x9e/0x1c0 [kafs]
      
      If the inode from afs_do_lookup() is an error other than ENOENT, or if it
      is ENOENT and afs_try_auto_mntpt() returns an error, the trace event will
      try to dereference the error pointer as a valid pointer.
      
      Use IS_ERR_OR_NULL to only pass a valid pointer for the trace, or NULL.
      
      Ideally the trace would include the error value, but for now just avoid
      the oops.
      
      Fixes: 80548b03 ("afs: Add more tracepoints")
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c4c613ff
    • D
      afs: Fix leak in afs_lookup_cell_rcu() · a5fb8e6c
      David Howells 提交于
      Fix a leak on the cell refcount in afs_lookup_cell_rcu() due to
      non-clearance of the default error in the case a NULL cell name is passed
      and the workstation default cell is used.
      
      Also put a bit at the end to make sure we don't leak a cell ref if we're
      going to be returning an error.
      
      This leak results in an assertion like the following when the kafs module is
      unloaded:
      
      	AFS: Assertion failed
      	2 == 1 is false
      	0x2 == 0x1 is false
      	------------[ cut here ]------------
      	kernel BUG at fs/afs/cell.c:770!
      	...
      	RIP: 0010:afs_manage_cells+0x220/0x42f [kafs]
      	...
      	 process_one_work+0x4c2/0x82c
      	 ? pool_mayday_timeout+0x1e1/0x1e1
      	 ? do_raw_spin_lock+0x134/0x175
      	 worker_thread+0x336/0x4a6
      	 ? rescuer_thread+0x4af/0x4af
      	 kthread+0x1de/0x1ee
      	 ? kthread_park+0xd4/0xd4
      	 ret_from_fork+0x24/0x30
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a5fb8e6c
    • J
      ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply · 28a28261
      Jeff Layton 提交于
      When ceph_mdsc_do_request returns an error, we can't assume that the
      filelock_reply pointer will be set. Only try to fetch fields out of
      the r_reply_info when it returns success.
      
      Cc: stable@vger.kernel.org
      Reported-by: NHector Martin <hector@marcansoft.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      28a28261
    • E
      ceph: clear page dirty before invalidate page · c95f1c5f
      Erqi Chen 提交于
      clear_page_dirty_for_io(page) before mapping->a_ops->invalidatepage().
      invalidatepage() clears page's private flag, if dirty flag is not
      cleared, the page may cause BUG_ON failure in ceph_set_page_dirty().
      
      Cc: stable@vger.kernel.org
      Link: https://tracker.ceph.com/issues/40862Signed-off-by: NErqi Chen <chenerqi@gmail.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c95f1c5f
    • L
      ceph: fix buffer free while holding i_ceph_lock in fill_inode() · af8a85a4
      Luis Henriques 提交于
      Calling ceph_buffer_put() in fill_inode() may result in freeing the
      i_xattrs.blob buffer while holding the i_ceph_lock.  This can be fixed by
      postponing the call until later, when the lock is released.
      
      The following backtrace was triggered by fstests generic/070.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
        6 locks held by kworker/0:4/3852:
         #0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0
         #1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0
         #2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476
         #3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476
         #4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
         #5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70
        CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Workqueue: ceph-msgr ceph_con_workfn
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         fill_inode.isra.0+0xa9b/0xf70
         ceph_fill_trace+0x13b/0xc70
         ? dispatch+0x2eb/0x1476
         dispatch+0x320/0x1476
         ? __mutex_unlock_slowpath+0x4d/0x2a0
         ceph_con_workfn+0xc97/0x2ec0
         ? process_one_work+0x1b8/0x5f0
         process_one_work+0x244/0x5f0
         worker_thread+0x4d/0x3e0
         kthread+0x105/0x140
         ? process_one_work+0x5f0/0x5f0
         ? kthread_park+0x90/0x90
         ret_from_fork+0x3a/0x50
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      af8a85a4
    • L
      ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob() · 12fe3dda
      Luis Henriques 提交于
      Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
      freeing the i_xattrs.blob buffer while holding the i_ceph_lock.  This can
      be fixed by having this function returning the old blob buffer and have
      the callers of this function freeing it when the lock is released.
      
      The following backtrace was triggered by fstests generic/117.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
        4 locks held by fsstress/649:
         #0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
         #1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
         #2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
         #3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
        CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         __ceph_build_xattrs_blob+0x12b/0x170
         __send_cap+0x302/0x540
         ? __lock_acquire+0x23c/0x1e40
         ? __mark_caps_flushing+0x15c/0x280
         ? _raw_spin_unlock+0x24/0x30
         ceph_check_caps+0x5f0/0xc60
         ceph_flush_dirty_caps+0x7c/0x150
         ? __ia32_sys_fdatasync+0x20/0x20
         ceph_sync_fs+0x5a/0x130
         iterate_supers+0x8f/0xf0
         ksys_sync+0x4f/0xb0
         __ia32_sys_sync+0xa/0x10
         do_syscall_64+0x50/0x1c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7fc6409ab617
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      12fe3dda
    • L
      ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr() · 86968ef2
      Luis Henriques 提交于
      Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
      i_xattrs.prealloc_blob buffer while holding the i_ceph_lock.  This can be
      fixed by postponing the call until later, when the lock is released.
      
      The following backtrace was triggered by fstests generic/117.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
        3 locks held by fsstress/650:
         #0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
         #1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
         #2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
        CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         __ceph_setxattr+0x2b4/0x810
         __vfs_setxattr+0x66/0x80
         __vfs_setxattr_noperm+0x59/0xf0
         vfs_setxattr+0x81/0xa0
         setxattr+0x115/0x230
         ? filename_lookup+0xc9/0x140
         ? rcu_read_lock_sched_held+0x74/0x80
         ? rcu_sync_lockdep_assert+0x2e/0x60
         ? __sb_start_write+0x142/0x1a0
         ? mnt_want_write+0x20/0x50
         path_setxattr+0xba/0xd0
         __x64_sys_lsetxattr+0x24/0x30
         do_syscall_64+0x50/0x1c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7ff23514359a
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      86968ef2
  8. 21 8月, 2019 2 次提交
    • J
      io_uring: don't enter poll loop if we have CQEs pending · a3a0e43f
      Jens Axboe 提交于
      We need to check if we have CQEs pending before starting a poll loop,
      as those could be the events we will be spinning for (and hence we'll
      find none). This can happen if a CQE triggers an error, or if it is
      found by eg an IRQ before we get a chance to find it through polling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3a0e43f
    • J
      io_uring: fix potential hang with polled IO · 500f9fba
      Jens Axboe 提交于
      If a request issue ends up being punted to async context to avoid
      blocking, we can get into a situation where the original application
      enters the poll loop for that very request before it has been issued.
      This should not be an issue, except that the polling will hold the
      io_uring uring_ctx mutex for the duration of the poll. When the async
      worker has actually issued the request, it needs to acquire this mutex
      to add the request to the poll issued list. Since the application
      polling is already holding this mutex, the workqueue sleeps on the
      mutex forever, and the application thus never gets a chance to poll for
      the very request it was interested in.
      
      Fix this by ensuring that the polling drops the uring_ctx occasionally
      if it's not making any progress.
      Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      500f9fba
  9. 20 8月, 2019 1 次提交
  10. 19 8月, 2019 2 次提交
    • E
      signal: Allow cifs and drbd to receive their terminating signals · 33da8e7c
      Eric W. Biederman 提交于
      My recent to change to only use force_sig for a synchronous events
      wound up breaking signal reception cifs and drbd.  I had overlooked
      the fact that by default kthreads start out with all signals set to
      SIG_IGN.  So a change I thought was safe turned out to have made it
      impossible for those kernel thread to catch their signals.
      
      Reverting the work on force_sig is a bad idea because what the code
      was doing was very much a misuse of force_sig.  As the way force_sig
      ultimately allowed the signal to happen was to change the signal
      handler to SIG_DFL.  Which after the first signal will allow userspace
      to send signals to these kernel threads.  At least for
      wake_ack_receiver in drbd that does not appear actively wrong.
      
      So correct this problem by adding allow_kernel_signal that will allow
      signals whose siginfo reports they were sent by the kernel through,
      but will not allow userspace generated signals, and update cifs and
      drbd to call allow_kernel_signal in an appropriate place so that their
      thread can receive this signal.
      
      Fixing things this way ensures that userspace won't be able to send
      signals and cause problems, that it is clear which signals the
      threads are expecting to receive, and it guarantees that nothing
      else in the system will be affected.
      
      This change was partly inspired by similar cifs and drbd patches that
      added allow_signal.
      Reported-by: Nronnie sahlberg <ronniesahlberg@gmail.com>
      Reported-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Tested-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Cc: Steve French <smfrench@gmail.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Fixes: 247bc947 ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
      Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
      Fixes: fee10990 ("signal/drbd: Use send_sig not force_sig")
      Fixes: 3cf5d076 ("signal: Remove task parameter from force_sig")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      33da8e7c
    • D
      xfs: fix reflink source file racing with directio writes · 5d888b48
      Darrick J. Wong 提交于
      While trawling through the dedupe file comparison code trying to fix
      page deadlocking problems, Dave Chinner noticed that the reflink code
      only takes shared IOLOCK/MMAPLOCKs on the source file.  Because
      page_mkwrite and directio writes do not take the EXCL versions of those
      locks, this means that reflink can race with writer processes.
      
      For pure remapping this can lead to undefined behavior and file
      corruption; for dedupe this means that we cannot be sure that the
      contents are identical when we decide to go ahead with the remapping.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5d888b48
  11. 17 8月, 2019 4 次提交
    • D
      vfs: fix page locking deadlocks when deduping files · edc58dd0
      Darrick J. Wong 提交于
      When dedupe wants to use the page cache to compare parts of two files
      for dedupe, we must be very careful to handle locking correctly.  The
      current code doesn't do this.  It must lock and unlock the page only
      once if the two pages are the same, since the overlapping range check
      doesn't catch this when blocksize < pagesize.  If the pages are distinct
      but from the same file, we must observe page locking order and lock them
      in order of increasing offset to avoid clashing with writeback locking.
      
      Fixes: 876bec6f ("vfs: refactor clone/dedupe_file_range common functions")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      edc58dd0
    • C
      xfs: compat_ioctl: use compat_ptr() · 4529e6d7
      Christoph Hellwig 提交于
      For 31-bit s390 user space, we have to pass pointer arguments through
      compat_ptr() in the compat_ioctl handler.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4529e6d7
    • C
      xfs: fall back to native ioctls for unhandled compat ones · 314e01a6
      Christoph Hellwig 提交于
      Always try the native ioctl if we don't have a compat handler.  This
      removes a lot of boilerplate code as 'modern' ioctls should generally
      be compat clean, and fixes the missing entries for the recently added
      FS_IOC_GETFSLABEL/FS_IOC_SETFSLABEL ioctls.
      
      Fixes: f7664b31 ("xfs: implement online get/set fs label")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      314e01a6
    • H
      nfsd4: Fix kernel crash when reading proc file reply_cache_stats · 78e70e78
      He Zhe 提交于
      reply_cache_stats uses wrong parameter as seq file private structure and
      thus causes the following kernel crash when users read
      /proc/fs/nfsd/reply_cache_stats
      
      BUG: kernel NULL pointer dereference, address: 00000000000001f9
      PGD 0 P4D 0
      Oops: 0000 [#3] SMP PTI
      CPU: 6 PID: 1502 Comm: cat Tainted: G      D           5.3.0-rc3+ #1
      Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, BIOS BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
      RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
      Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 d1 ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 00 48 c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
      RSP: 0018:ffffaa520106fe08 EFLAGS: 00010246
      RAX: 000000cfe1a77123 RBX: 0000000000000000 RCX: 0000000000291b46
      RDX: 000000cf00000000 RSI: 0000000000000006 RDI: 0000000000291b28
      RBP: ffffaa520106fe20 R08: 0000000000000006 R09: 000000cfe17e55dd
      R10: ffffa424e47c0000 R11: 000000000000030b R12: 0000000000000001
      R13: ffffa424e5697000 R14: 0000000000000001 R15: ffffa424e5697000
      FS:  00007f805735f580(0000) GS:ffffa424f8f80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000001f9 CR3: 00000000655ce005 CR4: 00000000003606e0
      Call Trace:
       seq_read+0x194/0x3e0
       __vfs_read+0x1b/0x40
       vfs_read+0x95/0x140
       ksys_read+0x61/0xe0
       __x64_sys_read+0x1a/0x20
       do_syscall_64+0x4d/0x120
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f805728b861
      Code: fe ff ff 50 48 8d 3d 86 b4 09 00 e8 79 e0 01 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 d9 19 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 48 83 ec 28 48 89 54
      RSP: 002b:00007ffea1ce3c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f805728b861
      RDX: 0000000000020000 RSI: 00007f8057183000 RDI: 0000000000000003
      RBP: 00007f8057183000 R08: 00007f8057182010 R09: 0000000000000000
      R10: 0000000000000022 R11: 0000000000000246 R12: 0000559a60e8ff10
      R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
      Modules linked in:
      CR2: 00000000000001f9
      ---[ end trace 01613595153f0cba ]---
      RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
      Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 d1 ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 00 48 c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
      RSP: 0018:ffffaa52004b3e08 EFLAGS: 00010246
      RAX: 0000002bab45a7c6 RBX: 0000000000000000 RCX: 0000000000291b4c
      RDX: 0000002b00000000 RSI: 0000000000000004 RDI: 0000000000291b28
      RBP: ffffaa52004b3e20 R08: 0000000000000004 R09: 0000002bab1c8c7a
      R10: ffffa424e5500000 R11: 00000000000002a9 R12: 0000000000000001
      R13: ffffa424e4475000 R14: 0000000000000001 R15: ffffa424e4475000
      FS:  00007f805735f580(0000) GS:ffffa424f8f80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000001f9 CR3: 00000000655ce005 CR4: 00000000003606e0
      Killed
      
      Fixes: 3ba75830 ("nfsd4: drc containerization")
      Signed-off-by: NHe Zhe <zhe.he@windriver.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      78e70e78
  12. 16 8月, 2019 6 次提交
  13. 14 8月, 2019 1 次提交
  14. 13 8月, 2019 1 次提交
    • D
      xfs: don't crash on null attr fork xfs_bmapi_read · 8612de3f
      Darrick J. Wong 提交于
      Zorro Lang reported a crash in generic/475 if we try to inactivate a
      corrupt inode with a NULL attr fork (stack trace shortened somewhat):
      
      RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs]
      RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51
      RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012
      RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef
      R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004
      R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001
      FS:  00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0
      Call Trace:
       xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs]
       xfs_da_read_buf+0xf5/0x2c0 [xfs]
       xfs_da3_node_read+0x1d/0x230 [xfs]
       xfs_attr_inactive+0x3cc/0x5e0 [xfs]
       xfs_inactive+0x4c8/0x5b0 [xfs]
       xfs_fs_destroy_inode+0x31b/0x8e0 [xfs]
       destroy_inode+0xbc/0x190
       xfs_bulkstat_one_int+0xa8c/0x1200 [xfs]
       xfs_bulkstat_one+0x16/0x20 [xfs]
       xfs_bulkstat+0x6fa/0xf20 [xfs]
       xfs_ioc_bulkstat+0x182/0x2b0 [xfs]
       xfs_file_ioctl+0xee0/0x12a0 [xfs]
       do_vfs_ioctl+0x193/0x1000
       ksys_ioctl+0x60/0x90
       __x64_sys_ioctl+0x6f/0xb0
       do_syscall_64+0x9f/0x4d0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f11d39a3e5b
      
      The "obvious" cause is that the attr ifork is null despite the inode
      claiming an attr fork having at least one extent, but it's not so
      obvious why we ended up with an inode in that state.
      Reported-by: NZorro Lang <zlang@redhat.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      8612de3f