1. 21 8月, 2019 1 次提交
    • J
      io_uring: fix potential hang with polled IO · 500f9fba
      Jens Axboe 提交于
      If a request issue ends up being punted to async context to avoid
      blocking, we can get into a situation where the original application
      enters the poll loop for that very request before it has been issued.
      This should not be an issue, except that the polling will hold the
      io_uring uring_ctx mutex for the duration of the poll. When the async
      worker has actually issued the request, it needs to acquire this mutex
      to add the request to the poll issued list. Since the application
      polling is already holding this mutex, the workqueue sleeps on the
      mutex forever, and the application thus never gets a chance to poll for
      the very request it was interested in.
      
      Fix this by ensuring that the polling drops the uring_ctx occasionally
      if it's not making any progress.
      Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      500f9fba
  2. 16 8月, 2019 3 次提交
    • J
      io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list · a982eeb0
      Jackie Liu 提交于
      This patch may fix two issues:
      
      First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
      defer list to delay execution, but link io will be actively scheduled to
      run by calling io_queue_sqe.
      
      Second, when multiple LINK_IOs are inserted together with defer_list,
      the LINK_IO is no longer keep order.
      
         |-------------|
         |   LINK_IO   |      ----> insert to defer_list  -----------
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   NORMAL_IO |      ----> insert to defer_list  ----------|
         |-------------|                                            |
                                                                    |
                                    queue_work at same time   <-----|
      
      Fixes: 9e645e11 ("io_uring: add support for sqe links")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a982eeb0
    • J
      block: remove REQ_NOWAIT_INLINE · 7b6620d7
      Jens Axboe 提交于
      We had a few issues with this code, and there's still a problem around
      how we deal with error handling for chained/split bios. For now, just
      revert the code and we'll try again with a thoroug solution. This
      reverts commits:
      
      e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
      0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b6620d7
    • A
      io_uring: fix manual setup of iov_iter for fixed buffers · 99c79f66
      Aleix Roca Nonell 提交于
      Commit bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed
      buffers") introduced an optimization to avoid using the slow
      iov_iter_advance by manually populating the iov_iter iterator in some
      cases.
      
      However, the computation of the iterator count field was erroneous: The
      first bvec was always accounted for an extent of page size even if the
      bvec length was smaller.
      
      In consequence, some I/O operations on fixed buffers were unable to
      operate on the full extent of the buffer, consistently skipping some
      bytes at the end of it.
      
      Fixes: bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers")
      Cc: stable@vger.kernel.org
      Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99c79f66
  3. 14 8月, 2019 1 次提交
  4. 13 8月, 2019 2 次提交
    • D
      xfs: don't crash on null attr fork xfs_bmapi_read · 8612de3f
      Darrick J. Wong 提交于
      Zorro Lang reported a crash in generic/475 if we try to inactivate a
      corrupt inode with a NULL attr fork (stack trace shortened somewhat):
      
      RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs]
      RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51
      RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012
      RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef
      R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004
      R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001
      FS:  00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0
      Call Trace:
       xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs]
       xfs_da_read_buf+0xf5/0x2c0 [xfs]
       xfs_da3_node_read+0x1d/0x230 [xfs]
       xfs_attr_inactive+0x3cc/0x5e0 [xfs]
       xfs_inactive+0x4c8/0x5b0 [xfs]
       xfs_fs_destroy_inode+0x31b/0x8e0 [xfs]
       destroy_inode+0xbc/0x190
       xfs_bulkstat_one_int+0xa8c/0x1200 [xfs]
       xfs_bulkstat_one+0x16/0x20 [xfs]
       xfs_bulkstat+0x6fa/0xf20 [xfs]
       xfs_ioc_bulkstat+0x182/0x2b0 [xfs]
       xfs_file_ioctl+0xee0/0x12a0 [xfs]
       do_vfs_ioctl+0x193/0x1000
       ksys_ioctl+0x60/0x90
       __x64_sys_ioctl+0x6f/0xb0
       do_syscall_64+0x9f/0x4d0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f11d39a3e5b
      
      The "obvious" cause is that the attr ifork is null despite the inode
      claiming an attr fork having at least one extent, but it's not so
      obvious why we ended up with an inode in that state.
      Reported-by: NZorro Lang <zlang@redhat.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      8612de3f
    • D
      xfs: remove more ondisk directory corruption asserts · 858b44dc
      Darrick J. Wong 提交于
      Continue our game of replacing ASSERTs for corrupt ondisk metadata with
      EFSCORRUPTED returns.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      858b44dc
  5. 09 8月, 2019 1 次提交
    • A
      gfs2: gfs2_walk_metadata fix · a27a0c9b
      Andreas Gruenbacher 提交于
      It turns out that the current version of gfs2_metadata_walker suffers
      from multiple problems that can cause gfs2_hole_size to report an
      incorrect size.  This will confuse fiemap as well as lseek with the
      SEEK_DATA flag.
      
      Fix that by changing gfs2_hole_walker to compute the metapath to the
      first data block after the hole (if any), and compute the hole size
      based on that.
      
      Fixes xfstest generic/490.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NBob Peterson <rpeterso@redhat.com>
      Cc: stable@vger.kernel.org # v4.18+
      a27a0c9b
  6. 08 8月, 2019 4 次提交
    • J
      bdev: Fixup error handling in blkdev_get() · e91455ba
      Jan Kara 提交于
      Commit 89e524c0 ("loop: Fix mount(2) failure due to race with
      LOOP_SET_FD") converted blkdev_get() to use the new helpers for
      finishing claiming of a block device. However the conversion botched the
      error handling in blkdev_get() and thus the bdev has been marked as held
      even in case __blkdev_get() returned error. This led to occasional
      warnings with block/001 test from blktests like:
      
      kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
      
      Correct the error handling.
      
      CC: stable@vger.kernel.org
      Fixes: 89e524c0 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e91455ba
    • G
      Revert "kernfs: fix memleak in kernel_ops_readdir()" · 8097c43b
      Greg Kroah-Hartman 提交于
      This reverts commit cc798c83.
      
      Tony writes:
      	Somehow this causes a regression in Linux next for me where I'm
      	seeing lots of sysfs entries now missing under
      	/sys/bus/platform/devices.
      
      	For example, I now only see one .serial entry show up in sysfs.
      	Things work again if I revert commit cc798c83 ("kernfs: fix
      	memleak inkernel_ops_readdir()"). Any ideas why that would be?
      
      Tejun says:
      	Ugh, you're right.  It can get double-put cuz ctx->pos is put by
      	release too.
      
      So reverting it for now.
      Reported-by: NTony Lindgren <tony@atomide.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Fixes: cc798c83 ("kernfs: fix memleak in kernel_ops_readdir()")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8097c43b
    • J
      block: fix O_DIRECT error handling for bio fragments · e15c2ffa
      Jens Axboe 提交于
      0eb6ddfb tried to fix this up, but introduced a use-after-free
      of dio. Additionally, we still had an issue with error handling,
      as reported by Darrick:
      
      "I noticed a regression in xfs/747 (an unreleased xfstest for the
      xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
      to a simpler reproducer:
      
      error-test: 0 209 linear 8:48 0
      error-test: 209 1 error
      error-test: 210 6446894 linear 8:48 210
      
      Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
      for sector 209 and to pass the io to the scsi device everywhere else.
      
      On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
      (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
      EIO like you'd expect:
      
      pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
      pread: Input/output error
      +++ exited with 0 +++
      
      But doing it with a larger buffer succeeds(!):
      
      pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
      read 1146880/1146880 bytes at offset 0
      1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
      +++ exited with 0 +++
      
      (Note that the part of the buffer corresponding to the dm-error area is
      uninitialized)
      
      On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
      only change between rc2 and rc3 is commit 0eb6ddfb ("block: Fix
      __blkdev_direct_IO() for bio fragments").
      
      AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
      split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
      and a second one for the 96k after that."
      
      Fix this by noting that it's always safe to dereference dio if we get
      BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
      we can safely increment the dio size before calling submit_bio(), and
      then decrement it on failure (not that it really matters, as the bio
      and dio are going away).
      
      For error handling, return to the original method of just using 'ret'
      for tracking the error, and the size tracking in dio->size.
      
      Fixes: 0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e15c2ffa
    • T
      NFSv4: Ensure state recovery handles ETIMEDOUT correctly · 67e7b52d
      Trond Myklebust 提交于
      Ensure that the state recovery code handles ETIMEDOUT correctly,
      and also that we set RPC_TASK_TIMEOUT when recovering open state.
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      67e7b52d
  7. 07 8月, 2019 2 次提交
    • Q
      btrfs: trim: Check the range passed into to prevent overflow · 07301df7
      Qu Wenruo 提交于
      Normally the range->len is set to default value (U64_MAX), but when it's
      not default value, we should check if the range overflows.
      
      And if it overflows, return -EINVAL before doing anything.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07301df7
    • F
      Btrfs: fix sysfs warning and missing raid sysfs directories · d7cd4dd9
      Filipe Manana 提交于
      In the 5.3 merge window, commit 7c7e3014 ("btrfs: sysfs: Replace
      default_attrs in ktypes with groups"), we started using the member
      "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
      to a series of warnings when running some test cases of fstests, such
      as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
      warnings are like the following:
      
        [116648.059212] kernfs: can not remove 'total_bytes', no directory
        [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
        [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
        [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.077143] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.077852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.080585] Call Trace:
        [116648.081262]  remove_files+0x31/0x70
        [116648.081929]  sysfs_remove_group+0x38/0x80
        [116648.082596]  sysfs_remove_groups+0x34/0x70
        [116648.083258]  kobject_del+0x20/0x60
        [116648.083933]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.084608]  close_ctree+0x19a/0x380 [btrfs]
        [116648.085278]  generic_shutdown_super+0x6c/0x110
        [116648.085951]  kill_anon_super+0xe/0x30
        [116648.086621]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.087289]  deactivate_locked_super+0x3a/0x70
        [116648.087956]  cleanup_mnt+0xb4/0x160
        [116648.088620]  task_work_run+0x7e/0xc0
        [116648.089285]  exit_to_usermode_loop+0xfa/0x100
        [116648.089933]  do_syscall_64+0x1cb/0x220
        [116648.090567]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.091197] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.100046] ---[ end trace 22e24db328ccadf8 ]---
        [116648.100618] ------------[ cut here ]------------
        [116648.101175] kernfs: can not remove 'used_bytes', no directory
        [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
        [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
        [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.113268] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.113939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.116649] Call Trace:
        [116648.117326]  remove_files+0x31/0x70
        [116648.117997]  sysfs_remove_group+0x38/0x80
        [116648.118671]  sysfs_remove_groups+0x34/0x70
        [116648.119342]  kobject_del+0x20/0x60
        [116648.120022]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.120707]  close_ctree+0x19a/0x380 [btrfs]
        [116648.121396]  generic_shutdown_super+0x6c/0x110
        [116648.122057]  kill_anon_super+0xe/0x30
        [116648.122702]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.123335]  deactivate_locked_super+0x3a/0x70
        [116648.123961]  cleanup_mnt+0xb4/0x160
        [116648.124586]  task_work_run+0x7e/0xc0
        [116648.125210]  exit_to_usermode_loop+0xfa/0x100
        [116648.125830]  do_syscall_64+0x1cb/0x220
        [116648.126463]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.127080] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.135923] ---[ end trace 22e24db328ccadf9 ]---
      
      These happen because, during the unmount path, we call kobject_del() for
      raid kobjects that are not fully initialized, meaning that we set their
      ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
      their parent kobject, which is done through btrfs_add_raid_kobjects().
      
      We have this split raid kobject setup since commit 75cb379d
      ("btrfs: defer adding raid type kobject until after chunk relocation") in
      order to avoid triggering reclaim during contextes where we can not
      (either we are holding a transaction handle or some lock required by
      the transaction commit path), so that we do the calls to kobject_add(),
      which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
      in contextes where it is safe to trigger reclaim. That change expected
      that a new raid kobject can only be created either when mounting the
      filesystem or after raid profile conversion through the relocation path.
      However, we can have new raid kobject created in other two cases at least:
      
      1) During device replace (or scrub) after adding a device a to the
         filesystem. The replace procedure (and scrub) do calls to
         btrfs_inc_block_group_ro() which can allocate a new block group
         with a new raid profile (because we now have more devices). This
         can be triggered by test cases btrfs/027 and btrfs/176.
      
      2) During a degraded mount trough any write path. This can be triggered
         by test case btrfs/124.
      
      Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
      makes things more complex and fragile, can also introduce deadlocks with
      reclaim the following way:
      
      1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
         anywhere in the replace/scrub path will cause a deadlock with reclaim
         because if reclaim happens and a transaction commit is triggered,
         the transaction commit path will block at btrfs_scrub_pause().
      
      2) During degraded mounts it is essentially impossible to figure out where
         to add extra calls to btrfs_add_raid_kobjects(), because allocation of
         a block group with a new raid profile can happen anywhere, which means
         we can't safely figure out which contextes are safe for reclaim, as
         we can either hold a transaction handle or some lock needed by the
         transaction commit path.
      
      So it is too complex and error prone to have this split setup of raid
      kobjects. So fix the issue by consolidating the setup of the kobjects in a
      single place, at link_block_group(), and setup a nofs context there in
      order to prevent reclaim being triggered by the memory allocations done
      through the call chain of kobject_add().
      
      Besides fixing the sysfs warnings during kobject_del(), this also ensures
      the sysfs directories for the new raid profiles end up created and visible
      to users (a bug that existed before the 5.3 commit 7c7e3014
      ("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
      
      Fixes: 75cb379d ("btrfs: defer adding raid type kobject until after chunk relocation")
      Fixes: 7c7e3014 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7cd4dd9
  8. 06 8月, 2019 6 次提交
  9. 05 8月, 2019 12 次提交
  10. 04 8月, 2019 1 次提交
    • T
      fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve(). · 294fc7a4
      Tetsuo Handa 提交于
      When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
      xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
      operation due to user-triggerable OOM condition. Since we are not using
      KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
      use the same flags at xfs_log_reserve().
      
        oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
        CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
        Call Trace:
         dump_stack+0x67/0x95
         warn_alloc+0xa9/0x140
         __alloc_pages_slowpath+0x9a8/0xbce
         __alloc_pages_nodemask+0x372/0x3b0
         alloc_slab_page+0x3a/0x8d0
         new_slab+0x330/0x420
         ___slab_alloc.constprop.94+0x879/0xb00
         __slab_alloc.isra.89.constprop.93+0x43/0x6f
         kmem_cache_alloc+0x331/0x390
         kmem_zone_alloc+0x9f/0x110 [xfs]
         kmem_zone_alloc+0x9f/0x110 [xfs]
         xlog_ticket_alloc+0x33/0xd0 [xfs]
         xfs_log_reserve+0xb4/0x410 [xfs]
         xfs_trans_reserve+0x1d1/0x2b0 [xfs]
         xfs_trans_alloc+0xc9/0x250 [xfs]
         xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
         xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
         xfs_vm_writepages+0x76/0xa0 [xfs]
         do_writepages+0x17/0x80
         __filemap_fdatawrite_range+0xc1/0xf0
         file_write_and_wait_range+0x53/0xa0
         xfs_file_fsync+0x87/0x290 [xfs]
         vfs_fsync_range+0x37/0x80
         do_fsync+0x38/0x60
         __x64_sys_fsync+0xf/0x20
         do_syscall_64+0x4a/0x1c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: eb01c9cd ("[XFS] Remove the xlog_ticket allocator")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      294fc7a4
  11. 03 8月, 2019 2 次提交
    • P
      coredump: split pipe command whitespace before expanding template · 315c6926
      Paul Wise 提交于
      Save the offsets of the start of each argument to avoid having to update
      pointers to each argument after every corename krealloc and to avoid
      having to duplicate the memory for the dump command.
      
      Executable names containing spaces were previously being expanded from
      %e or %E and then split in the middle of the filename.  This is
      incorrect behaviour since an argument list can represent arguments with
      spaces.
      
      The splitting could lead to extra arguments being passed to the core
      dump handler that it might have interpreted as options or ignored
      completely.
      
      Core dump handlers that are not aware of this Linux kernel issue will be
      using %e or %E without considering that it may be split and so they will
      be vulnerable to processes with spaces in their names breaking their
      argument list.  If their internals are otherwise well written, such as
      if they are written in shell but quote arguments, they will work better
      after this change than before.  If they are not well written, then there
      is a slight chance of breakage depending on the details of the code but
      they will already be fairly broken by the split filenames.
      
      Core dump handlers that are aware of this Linux kernel issue will be
      placing %e or %E as the last item in their core_pattern and then
      aggregating all of the remaining arguments into one, separated by
      spaces.  Alternatively they will be obtaining the filename via other
      methods.  Both of these will be compatible with the new arrangement.
      
      A side effect from this change is that unknown template types (for
      example %z) result in an empty argument to the dump handler instead of
      the argument being dropped.  This is a desired change as:
      
      It is easier for dump handlers to process empty arguments than dropped
      ones, especially if they are written in shell or don't pass each
      template item with a preceding command-line option in order to
      differentiate between individual template types.  Most core_patterns in
      the wild do not use options so they can confuse different template types
      (especially numeric ones) if an earlier one gets dropped in old kernels.
      If the kernel introduces a new template type and a core_pattern uses it,
      the core dump handler might not expect that the argument can be dropped
      in old kernels.
      
      For example, this can result in security issues when %d is dropped in
      old kernels.  This happened with the corekeeper package in Debian and
      resulted in the interface between corekeeper and Linux having to be
      rewritten to use command-line options to differentiate between template
      types.
      
      The core_pattern for most core dump handlers is written by the handler
      author who would generally not insert unknown template types so this
      change should be compatible with all the core dump handlers that exist.
      
      Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net
      Fixes: 74aadce9 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe")
      Signed-off-by: NPaul Wise <pabs3@bonedaddy.net>
      Reported-by: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
      Reported-by: Paul Wise <pabs3@bonedaddy.net> [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/]
      Suggested-by: NJakub Wilk <jwilk@jwilk.net>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      315c6926
    • Y
      ocfs2: remove set but not used variable 'last_hash' · 7bc36e3c
      YueHaibing 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
        fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find:
        fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable]
      
      It's never used and can be removed.
      
      Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.comSigned-off-by: NYueHaibing <yuehaibing@huawei.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bc36e3c
  12. 02 8月, 2019 1 次提交
    • D
      block: Fix __blkdev_direct_IO() for bio fragments · 0eb6ddfb
      Damien Le Moal 提交于
      The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
      (patch 6a43074e) introduced two problems with BIO fragment handling
      for direct IOs:
      1) The dio size processed is calculated by incrementing the ret variable
      by the size of the bio fragment issued for the dio. However, this size
      is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
      which may result in referencing the bi_size value after the bio
      completed, resulting in an incorrect value use.
      2) The ret variable is not incremented by the size of the last bio
      fragment issued for the bio, leading to an invalid IO size being
      returned to the user.
      
      Fix both problem by using dio->size (which is incremented before the bio
      submission) to update the value of ret after bio submissions, including
      for the last bio fragment issued.
      
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0eb6ddfb
  13. 01 8月, 2019 2 次提交
    • A
      gfs2: Inode dirtying fix · 706cb549
      Andreas Gruenbacher 提交于
      With the recent iomap write page reclaim deadlock fix, it turns out that the
      GLF_DIRTY flag isn't always set when it needs to be anymore: previously, this
      happened as a side effect of always adding the inode buffer head to the current
      transaction with gfs2_trans_add_meta, but this isn't happening consistently
      anymore.  Fix by removing an additional unnecessary gfs2_trans_add_meta call
      and by setting the GLF_DIRTY flag in gfs2_iomap_end.
      
      (The GLF_DIRTY flag causes inode_go_sync to flush the transaction log when
      syncing out the glock of that inode.  When the flag isn't set, inode_go_sync
      will skip inodes, including ones with an i_state of I_DIRTY_PAGES, which will
      lead to cluster incoherency.)
      
      In addition, in gfs2_iomap_page_done, if the metadata has changed, mark the
      inode as I_DIRTY_DATASYNC to have the inode added to the current transaction:
      we don't expect metadata to change here, but let's err on the safe side.
      
      Fixes: d0a22a4b ("gfs2: Fix iomap write page reclaim deadlock");
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      706cb549
    • A
      Unbreak mount_capable() · c2c44ec2
      Al Viro 提交于
      In "consolidate the capability checks in sget_{fc,userns}())" the
      wrong argument had been passed to mount_capable() by sget_fc().
      That mistake had been further obscured later, when switching
      mount_capable() to fs_context has moved the calculation of
      bogus argument from sget_fc() to mount_capable() itself.  It
      should've been fc->user_ns all along.
      Screwed-up-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: NChristian Brauner <christian@brauner.io>
      Tested-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2c44ec2
  14. 31 7月, 2019 2 次提交
    • J
      io_uring: fix KASAN use after free in io_sq_wq_submit_work · d0ee8791
      Jackie Liu 提交于
      [root@localhost ~]# ./liburing/test/link
      
      QEMU Standard PC report that:
      
      [   29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622-dirty #86
      [   29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work
      [   29.379929] Call Trace:
      [   29.379953]  dump_stack+0xa9/0x10e
      [   29.379970]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.379986]  print_address_description.cold.6+0x9/0x317
      [   29.379999]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380010]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380026]  __kasan_report.cold.7+0x1a/0x34
      [   29.380044]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380061]  kasan_report+0xe/0x12
      [   29.380076]  io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380104]  ? io_sq_thread+0xaf0/0xaf0
      [   29.380152]  process_one_work+0xb59/0x19e0
      [   29.380184]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
      [   29.380221]  worker_thread+0x8c/0xf40
      [   29.380248]  ? __kthread_parkme+0xab/0x110
      [   29.380265]  ? process_one_work+0x19e0/0x19e0
      [   29.380278]  kthread+0x30b/0x3d0
      [   29.380292]  ? kthread_create_on_node+0xe0/0xe0
      [   29.380311]  ret_from_fork+0x3a/0x50
      
      [   29.380635] Allocated by task 209:
      [   29.381255]  save_stack+0x19/0x80
      [   29.381268]  __kasan_kmalloc.constprop.6+0xc1/0xd0
      [   29.381279]  kmem_cache_alloc+0xc0/0x240
      [   29.381289]  io_submit_sqe+0x11bc/0x1c70
      [   29.381300]  io_ring_submit+0x174/0x3c0
      [   29.381311]  __x64_sys_io_uring_enter+0x601/0x780
      [   29.381322]  do_syscall_64+0x9f/0x4d0
      [   29.381336]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [   29.381633] Freed by task 84:
      [   29.382186]  save_stack+0x19/0x80
      [   29.382198]  __kasan_slab_free+0x11d/0x160
      [   29.382210]  kmem_cache_free+0x8c/0x2f0
      [   29.382220]  io_put_req+0x22/0x30
      [   29.382230]  io_sq_wq_submit_work+0x28b/0xe90
      [   29.382241]  process_one_work+0xb59/0x19e0
      [   29.382251]  worker_thread+0x8c/0xf40
      [   29.382262]  kthread+0x30b/0x3d0
      [   29.382272]  ret_from_fork+0x3a/0x50
      
      [   29.382569] The buggy address belongs to the object at ffff888067172140
                      which belongs to the cache io_kiocb of size 224
      [   29.384692] The buggy address is located 120 bytes inside of
                      224-byte region [ffff888067172140, ffff888067172220)
      [   29.386723] The buggy address belongs to the page:
      [   29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0
      [   29.387587] flags: 0x100000000000200(slab)
      [   29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180
      [   29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [   29.387624] page dumped because: kasan: bad access detected
      
      [   29.387920] Memory state around the buggy address:
      [   29.388771]  ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [   29.390062]  ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      [   29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   29.392578]                                         ^
      [   29.393480]  ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.394744]  ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.396003] ==================================================================
      [   29.397260] Disabling lock debugging due to kernel taint
      
      io_sq_wq_submit_work free and read req again.
      
      Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
      Cc: linux-block@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: f7b76ac9 ("io_uring: fix counter inc/dec mismatch in async_list")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d0ee8791
    • A
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · 055d8824
      Arnd Bergmann 提交于
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      Acked-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      055d8824