1. 03 4月, 2019 8 次提交
    • O
      NFSv4.1 don't free interrupted slot on open · 64751542
      Olga Kornievskaia 提交于
      commit 0cb98abb5bd13b9a636bde603d952d722688b428 upstream.
      
      Allow the async rpc task for finish and update the open state if needed,
      then free the slot. Otherwise, the async rpc unable to decode the reply.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Fixes: ae55e59d ("pnfs: Don't release the sequence slot...")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64751542
    • N
      NFS: fix mount/umount race in nlmclnt. · da57cba4
      NeilBrown 提交于
      commit 4a9be28c45bf02fa0436808bb6c0baeba30e120e upstream.
      
      If the last NFSv3 unmount from a given host races with a mount from the
      same host, we can destroy an nlm_host that is still in use.
      
      Specifically nlmclnt_lookup_host() can increment h_count on
      an nlm_host that nlmclnt_release_host() has just successfully called
      refcount_dec_and_test() on.
      Once nlmclnt_lookup_host() drops the mutex, nlm_destroy_host_lock()
      will be called to destroy the nlmclnt which is now in use again.
      
      The cause of the problem is that the dec_and_test happens outside the
      locked region.  This is easily fixed by using
      refcount_dec_and_mutex_lock().
      
      Fixes: 8ea6ecc8 ("lockd: Create client-side nlm_host cache")
      Cc: stable@vger.kernel.org (v2.6.38+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da57cba4
    • F
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · fd1b2536
      Filipe Manana 提交于
      commit 0ccc3876e4b2a1559a4dbe3126dda4459d38a83b upstream.
      
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd1b2536
    • N
      btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size · 0ae3b84b
      Nikolay Borisov 提交于
      commit 139a56170de67101791d6e6c8e940c6328393fe9 upstream.
      
      qgroup_rsv_size is calculated as the product of
      outstanding_extent * fs_info->nodesize. The product is calculated with
      32 bit precision since both variables are defined as u32. Yet
      qgroup_rsv_size expects a 64 bit result.
      
      Avoid possible multiplication overflow by casting outstanding_extent to
      u64. Such overflow would in the worst case (64K nodesize) require more
      than 65536 extents, which is quite large and i'ts not likely that it
      would happen in practice.
      
      Fixes-coverity-id: 1435101
      Fixes: ff6bc37e ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ae3b84b
    • A
      btrfs: raid56: properly unmap parity page in finish_parity_scrub() · 1cf4ab01
      Andrea Righi 提交于
      commit 3897b6f0a859288c22fb793fad11ec2327e60fcd upstream.
      
      Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
      a reference counter bug on i386, i.e.:
      
       [ 157.662401] kernel BUG at mm/highmem.c:349!
       [ 157.666725] invalid opcode: 0000 [#1] SMP PTI
      
      The reason is that kunmap(p_page) was completely left out, so we never
      did an unmap for the p_page and the loop unmapping the rbio page was
      iterating over the wrong number of stripes: unmapping should be done
      with nr_data instead of rbio->real_stripes.
      
      Test case to reproduce the bug:
      
       - create a raid5 btrfs filesystem:
         # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
      
       - mount it:
         # mount /dev/sdb /mnt
      
       - run btrfs scrub in a loop:
         # while :; do btrfs scrub start -BR /mnt; done
      
      BugLink: https://bugs.launchpad.net/bugs/1812845
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cf4ab01
    • D
      btrfs: don't report readahead errors and don't update statistics · d952c337
      David Sterba 提交于
      commit 0cc068e6ee59c1fffbfa977d8bf868b7551d80ac upstream.
      
      As readahead is an optimization, all errors are usually filtered out,
      but still properly handled when the real read call is done. The commit
      5e9d3982 ("btrfs: readpages() should submit IO as read-ahead") added
      REQ_RAHEAD to readpages() because that's only used for readahead
      (despite what one would expect from the callback name).
      
      This causes a flood of messages and inflated read error stats, so skip
      reporting in case it's readahead.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403Reported-by: NLimeTech <tomm@lime-technology.com>
      Fixes: 5e9d3982 ("btrfs: readpages() should submit IO as read-ahead")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d952c337
    • J
      btrfs: remove WARN_ON in log_dir_items · b57220cc
      Josef Bacik 提交于
      commit 2cc8334270e281815c3850c3adea363c51f21e0d upstream.
      
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b57220cc
    • F
      Btrfs: fix incorrect file size after shrinking truncate and fsync · 22dcb30f
      Filipe Manana 提交于
      commit bf504110bc8aa05df48b0e5f0aa84bfb81e0574b upstream.
      
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22dcb30f
  2. 27 3月, 2019 7 次提交
    • C
      f2fs: fix to avoid deadlock of atomic file operations · 1fd916e8
      Chao Yu 提交于
      commit 48432984d718c95cf13e26d487c2d1b697c3c01f upstream.
      
      Thread A				Thread B
      - __fput
       - f2fs_release_file
        - drop_inmem_pages
         - mutex_lock(&fi->inmem_lock)
         - __revoke_inmem_pages
          - lock_page(page)
      					- open
      					- f2fs_setattr
      					- truncate_setsize
      					 - truncate_inode_pages_range
      					  - lock_page(page)
      					  - truncate_cleanup_page
      					   - f2fs_invalidate_page
      					    - drop_inmem_page
      					    - mutex_lock(&fi->inmem_lock);
      
      We may encounter above ABBA deadlock as reported by Kyungtae Kim:
      
      I'm reporting a bug in linux-4.17.19: "INFO: task hung in
      drop_inmem_page" (no reproducer)
      
      I think this might be somehow related to the following:
      https://groups.google.com/forum/#!searchin/syzkaller-bugs/INFO$3A$20task$20hung$20in$20%7Csort:date/syzkaller-bugs/c6soBTrdaIo/AjAzPeIzCgAJ
      
      =========================================
      INFO: task syz-executor7:10822 blocked for more than 120 seconds.
            Not tainted 4.17.19 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor7   D27024 10822   6346 0x00000004
      Call Trace:
       context_switch kernel/sched/core.c:2867 [inline]
       __schedule+0x721/0x1e60 kernel/sched/core.c:3515
       schedule+0x88/0x1c0 kernel/sched/core.c:3559
       schedule_preempt_disabled+0x18/0x30 kernel/sched/core.c:3617
       __mutex_lock_common kernel/locking/mutex.c:833 [inline]
       __mutex_lock+0x5bd/0x1410 kernel/locking/mutex.c:893
       mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:908
       drop_inmem_page+0xcb/0x810 fs/f2fs/segment.c:327
       f2fs_invalidate_page+0x337/0x5e0 fs/f2fs/data.c:2401
       do_invalidatepage mm/truncate.c:165 [inline]
       truncate_cleanup_page+0x261/0x330 mm/truncate.c:187
       truncate_inode_pages_range+0x552/0x1610 mm/truncate.c:367
       truncate_inode_pages mm/truncate.c:478 [inline]
       truncate_pagecache+0x6d/0x90 mm/truncate.c:801
       truncate_setsize+0x81/0xa0 mm/truncate.c:826
       f2fs_setattr+0x44f/0x1270 fs/f2fs/file.c:781
       notify_change+0xa62/0xe80 fs/attr.c:313
       do_truncate+0x12e/0x1e0 fs/open.c:63
       do_last fs/namei.c:2955 [inline]
       path_openat+0x2042/0x29f0 fs/namei.c:3505
       do_filp_open+0x1bd/0x2c0 fs/namei.c:3540
       do_sys_open+0x35e/0x4e0 fs/open.c:1101
       __do_sys_open fs/open.c:1119 [inline]
       __se_sys_open fs/open.c:1114 [inline]
       __x64_sys_open+0x89/0xc0 fs/open.c:1114
       do_syscall_64+0xc4/0x4e0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      RSP: 002b:00007f734e459c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
      RAX: ffffffffffffffda RBX: 00007f734e45a6cc RCX: 00000000004497b9
      RDX: 0000000000000104 RSI: 00000000000a8280 RDI: 0000000020000080
      RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 0000000000007230 R14: 00000000006f02d0 R15: 00007f734e45a700
      INFO: task syz-executor7:10858 blocked for more than 120 seconds.
            Not tainted 4.17.19 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor7   D28880 10858   6346 0x00000004
      Call Trace:
       context_switch kernel/sched/core.c:2867 [inline]
       __schedule+0x721/0x1e60 kernel/sched/core.c:3515
       schedule+0x88/0x1c0 kernel/sched/core.c:3559
       __rwsem_down_write_failed_common kernel/locking/rwsem-xadd.c:565 [inline]
       rwsem_down_write_failed+0x5e6/0xc90 kernel/locking/rwsem-xadd.c:594
       call_rwsem_down_write_failed+0x17/0x30 arch/x86/lib/rwsem.S:117
       __down_write arch/x86/include/asm/rwsem.h:142 [inline]
       down_write+0x58/0xa0 kernel/locking/rwsem.c:72
       inode_lock include/linux/fs.h:713 [inline]
       do_truncate+0x120/0x1e0 fs/open.c:61
       do_last fs/namei.c:2955 [inline]
       path_openat+0x2042/0x29f0 fs/namei.c:3505
       do_filp_open+0x1bd/0x2c0 fs/namei.c:3540
       do_sys_open+0x35e/0x4e0 fs/open.c:1101
       __do_sys_open fs/open.c:1119 [inline]
       __se_sys_open fs/open.c:1114 [inline]
       __x64_sys_open+0x89/0xc0 fs/open.c:1114
       do_syscall_64+0xc4/0x4e0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      RSP: 002b:00007f734e3b4c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
      RAX: ffffffffffffffda RBX: 00007f734e3b56cc RCX: 00000000004497b9
      RDX: 0000000000000104 RSI: 00000000000a8280 RDI: 0000000020000080
      RBP: 000000000071c238 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 0000000000007230 R14: 00000000006f02d0 R15: 00007f734e3b5700
      INFO: task syz-executor5:10829 blocked for more than 120 seconds.
            Not tainted 4.17.19 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor5   D28760 10829   6308 0x80000002
      Call Trace:
       context_switch kernel/sched/core.c:2867 [inline]
       __schedule+0x721/0x1e60 kernel/sched/core.c:3515
       schedule+0x88/0x1c0 kernel/sched/core.c:3559
       io_schedule+0x21/0x80 kernel/sched/core.c:5179
       wait_on_page_bit_common mm/filemap.c:1100 [inline]
       __lock_page+0x2b5/0x390 mm/filemap.c:1273
       lock_page include/linux/pagemap.h:483 [inline]
       __revoke_inmem_pages+0xb35/0x11c0 fs/f2fs/segment.c:231
       drop_inmem_pages+0xa3/0x3e0 fs/f2fs/segment.c:306
       f2fs_release_file+0x2c7/0x330 fs/f2fs/file.c:1556
       __fput+0x2c7/0x780 fs/file_table.c:209
       ____fput+0x1a/0x20 fs/file_table.c:243
       task_work_run+0x151/0x1d0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x8ba/0x30a0 kernel/exit.c:865
       do_group_exit+0x13b/0x3a0 kernel/exit.c:968
       get_signal+0x6bb/0x1650 kernel/signal.c:2482
       do_signal+0x84/0x1b70 arch/x86/kernel/signal.c:810
       exit_to_usermode_loop+0x155/0x190 arch/x86/entry/common.c:162
       prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
       do_syscall_64+0x445/0x4e0 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      RSP: 002b:00007f1c68e74ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
      RAX: fffffffffffffe00 RBX: 000000000071bf80 RCX: 00000000004497b9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000071bf80
      RBP: 000000000071bf80 R08: 0000000000000000 R09: 000000000071bf58
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 00007f1c68e759c0 R15: 00007f1c68e75700
      
      This patch tries to use trylock_page to mitigate such deadlock condition
      for fix.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fd916e8
    • Z
      ext4: brelse all indirect buffer in ext4_ind_remove_space() · d12d8641
      zhangyi (F) 提交于
      commit 674a2b27234d1b7afcb0a9162e81b2e53aeef217 upstream.
      
      All indirect buffers get by ext4_find_shared() should be released no
      mater the branch should be freed or not. But now, we forget to release
      the lower depth indirect buffers when removing space from the same
      higher depth indirect block. It will lead to buffer leak and futher
      more, it may lead to quota information corruption when using old quota,
      consider the following case.
      
       - Create and mount an empty ext4 filesystem without extent and quota
         features,
       - quotacheck and enable the user & group quota,
       - Create some files and write some data to them, and then punch hole
         to some files of them, it may trigger the buffer leak problem
         mentioned above.
       - Disable quota and run quotacheck again, it will create two new
         aquota files and write the checked quota information to them, which
         probably may reuse the freed indirect block(the buffer and page
         cache was not freed) as data block.
       - Enable quota again, it will invoke
         vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
         buffers and pagecache. Unfortunately, because of the buffer of quota
         data block is still referenced, quota code cannot read the up to date
         quota info from the device and lead to quota information corruption.
      
      This problem can be reproduced by xfstests generic/231 on ext3 file
      system or ext4 file system without extent and quota features.
      
      This patch fix this problem by releasing the missing indirect buffers,
      in ext4_ind_remove_space().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d12d8641
    • L
      ext4: fix data corruption caused by unaligned direct AIO · 76c9ee6b
      Lukas Czerner 提交于
      commit 372a03e01853f860560eade508794dd274e9b390 upstream.
      
      Ext4 needs to serialize unaligned direct AIO because the zeroing of
      partial blocks of two competing unaligned AIOs can result in data
      corruption.
      
      However it decides not to serialize if the potentially unaligned aio is
      past i_size with the rationale that no pending writes are possible past
      i_size. Unfortunately if the i_size is not block aligned and the second
      unaligned write lands past i_size, but still into the same block, it has
      the potential of corrupting the previous unaligned write to the same
      block.
      
      This is (very simplified) reproducer from Frank
      
          // 41472 = (10 * 4096) + 512
          // 37376 = 41472 - 4096
      
          ftruncate(fd, 41472);
          io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
          io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);
      
          io_submit(io_ctx, 1, &iocbs[1]);
          io_submit(io_ctx, 1, &iocbs[2]);
      
          io_getevents(io_ctx, 2, 2, events, NULL);
      
      Without this patch the 512B range from 40960 up to the start of the
      second unaligned write (41472) is going to be zeroed overwriting the data
      written by the first write. This is a data corruption.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      
      With this patch the data corruption is avoided because we will recognize
      the unaligned_aio and wait for the unwritten extent conversion.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      *
      0000b200
      Reported-by: NFrank Sorenson <fsorenso@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Fixes: e9e3bcec ("ext4: serialize unaligned asynchronous DIO")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76c9ee6b
    • J
      ext4: fix NULL pointer dereference while journal is aborted · 558331d0
      Jiufei Xue 提交于
      commit fa30dde38aa8628c73a6dded7cb0bba38c27b576 upstream.
      
      We see the following NULL pointer dereference while running xfstests
      generic/475:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 8000000c84bad067 P4D 8000000c84bad067 PUD c84e62067 PMD 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 9886 Comm: fsstress Kdump: loaded Not tainted 5.0.0-rc8 #10
      RIP: 0010:ext4_do_update_inode+0x4ec/0x760
      ...
      Call Trace:
      ? jbd2_journal_get_write_access+0x42/0x50
      ? __ext4_journal_get_write_access+0x2c/0x70
      ? ext4_truncate+0x186/0x3f0
      ext4_mark_iloc_dirty+0x61/0x80
      ext4_mark_inode_dirty+0x62/0x1b0
      ext4_truncate+0x186/0x3f0
      ? unmap_mapping_pages+0x56/0x100
      ext4_setattr+0x817/0x8b0
      notify_change+0x1df/0x430
      do_truncate+0x5e/0x90
      ? generic_permission+0x12b/0x1a0
      
      This is triggered because the NULL pointer handle->h_transaction was
      dereferenced in function ext4_update_inode_fsync_trans().
      I found that the h_transaction was set to NULL in jbd2__journal_restart
      but failed to attached to a new transaction while the journal is aborted.
      
      Fix this by checking the handle before updating the inode.
      
      Fixes: b436b9be ("ext4: Wait for proper transaction commit on fsync")
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      558331d0
    • S
      SMB3: Fix SMB3.1.1 guest mounts to Samba · 38bd575b
      Steve French 提交于
      commit 8c11a607d1d9cd6e7f01fd6b03923597fb0ef95a upstream.
      
      Workaround problem with Samba responses to SMB3.1.1
      null user (guest) mounts.  The server doesn't set the
      expected flag in the session setup response so we have
      to do a similar check to what is done in smb3_validate_negotiate
      where we also check if the user is a null user (but not sec=krb5
      since username might not be passed in on mount for Kerberos case).
      
      Note that the commit below tightened the conditions and forced signing
      for the SMB2-TreeConnect commands as per MS-SMB2.
      However, this should only apply to normal user sessions and not for
      cases where there is no user (even if server forgets to set the flag
      in the response) since we don't have anything useful to sign with.
      This is especially important now that the more secure SMB3.1.1 protocol
      is in the default dialect list.
      
      An earlier patch ("cifs: allow guest mounts to work for smb3.11") fixed
      the guest mounts to Windows.
      
          Fixes: 6188f28b ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares")
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NPaulo Alcantara <palcantara@suse.de>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38bd575b
    • R
      cifs: allow guest mounts to work for smb3.11 · 14c52aca
      Ronnie Sahlberg 提交于
      commit e71ab2aa06f731a944993120b0eef1556c63b81c upstream.
      
      Fix Guest/Anonymous sessions so that they work with SMB 3.11.
      
      The commit noted below tightened the conditions and forced signing for
      the SMB2-TreeConnect commands as per MS-SMB2.
      However, this should only apply to normal user sessions and not for
      Guest/Anonumous sessions.
      
      Fixes: 6188f28b ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares")
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14c52aca
    • J
      udf: Fix crash on IO error during truncate · c72e90d9
      Jan Kara 提交于
      commit d3ca4651d05c0ff7259d087d8c949bcf3e14fb46 upstream.
      
      When truncate(2) hits IO error when reading indirect extent block the
      code just bugs with:
      
      kernel BUG at linux-4.15.0/fs/udf/truncate.c:249!
      ...
      
      Fix the problem by bailing out cleanly in case of IO error.
      
      CC: stable@vger.kernel.org
      Reported-by: Njean-luc malet <jeanluc.malet@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c72e90d9
  3. 24 3月, 2019 25 次提交
    • T
      NFSv4.1: Reinitialise sequence results before retransmitting a request · 4af185fe
      Trond Myklebust 提交于
      commit c1dffe0bf7f9c3d57d9f237a7cb2a81e62babd2b upstream.
      
      If we have to retransmit a request, we should ensure that we reinitialise
      the sequence results structure, since in the event of a signal
      we need to treat the request as if it had not been sent.
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4af185fe
    • Y
      nfsd: fix wrong check in write_v4_end_grace() · ecab6ab1
      Yihao Wu 提交于
      commit dd838821f0a29781b185cd8fb8e48d5c177bd838 upstream.
      
      Commit 62a063b8e7d1 "nfsd4: fix crash on writing v4_end_grace before
      nfsd startup" is trying to fix a NULL dereference issue, but it
      mistakenly checks if the nfsd server is started. So fix it.
      
      Fixes: 62a063b8e7d1 "nfsd4: fix crash on writing v4_end_grace before nfsd startup"
      Cc: stable@vger.kernel.org
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecab6ab1
    • N
      nfsd: fix memory corruption caused by readdir · 8056912c
      NeilBrown 提交于
      commit b602345da6cbb135ba68cf042df8ec9a73da7981 upstream.
      
      If the result of an NFSv3 readdir{,plus} request results in the
      "offset" on one entry having to be split across 2 pages, and is sized
      so that the next directory entry doesn't fit in the requested size,
      then memory corruption can happen.
      
      When encode_entry() is called after encoding the last entry that fits,
      it notices that ->offset and ->offset1 are set, and so stores the
      offset value in the two pages as required.  It clears ->offset1 but
      *does not* clear ->offset.
      
      Normally this omission doesn't matter as encode_entry_baggage() will
      be called, and will set ->offset to a suitable value (not on a page
      boundary).
      But in the case where cd->buflen < elen and nfserr_toosmall is
      returned, ->offset is not reset.
      
      This means that nfsd3proc_readdirplus will see ->offset with a value 4
      bytes before the end of a page, and ->offset1 set to NULL.
      It will try to write 8bytes to ->offset.
      If we are lucky, the next page will be read-only, and the system will
        BUG: unable to handle kernel paging request at...
      
      If we are unlucky, some innocent page will have the first 4 bytes
      corrupted.
      
      nfsd3proc_readdir() doesn't even check for ->offset1, it just blindly
      writes 8 bytes to the offset wherever it is.
      
      Fix this by clearing ->offset after it is used, and copying the
      ->offset handling code from nfsd3_proc_readdirplus into
      nfsd3_proc_readdir.
      
      (Note that the commit hash in the Fixes tag is from the 'history'
       tree - this bug predates git).
      
      Fixes: 0b1d57cf7654 ("[PATCH] kNFSd: Fix nfs3 dentry encoding")
      Fixes-URL: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=0b1d57cf7654
      Cc: stable@vger.kernel.org (v2.6.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8056912c
    • J
      nfsd: fix performance-limiting session calculation · 10a68cdf
      J. Bruce Fields 提交于
      commit c54f24e338ed2a35218f117a4a1afb5f9e2b4e64 upstream.
      
      We're unintentionally limiting the number of slots per nfsv4.1 session
      to 10.  Often more than 10 simultaneous RPCs are needed for the best
      performance.
      
      This calculation was meant to prevent any one client from using up more
      than a third of the limit we set for total memory use across all clients
      and sessions.  Instead, it's limiting the client to a third of the
      maximum for a single session.
      
      Fix this.
      Reported-by: NChris Tracy <ctracy@engr.scu.edu>
      Cc: stable@vger.kernel.org
      Fixes: de766e57 "nfsd: give out fewer session slots as limit approaches"
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10a68cdf
    • T
      NFS: Don't recoalesce on error in nfs_pageio_complete_mirror() · 2c648caf
      Trond Myklebust 提交于
      commit 8127d82705998568b52ac724e28e00941538083d upstream.
      
      If the I/O completion failed with a fatal error, then we should just
      exit nfs_pageio_complete_mirror() rather than try to recoalesce.
      
      Fixes: a7d42ddb ("nfs: add mirroring support to pgio layer")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c648caf
    • T
      NFS: Fix an I/O request leakage in nfs_do_recoalesce · 63b0ee12
      Trond Myklebust 提交于
      commit 4d91969ed4dbcefd0e78f77494f0cb8fada9048a upstream.
      
      Whether we need to exit early, or just reprocess the list, we
      must not lost track of the request which failed to get recoalesced.
      
      Fixes: 03d5eb65 ("NFS: Fix a memory leak in nfs_do_recoalesce")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63b0ee12
    • T
      NFS: Fix I/O request leakages · be74fddc
      Trond Myklebust 提交于
      commit f57dcf4c72113c745d83f1c65f7291299f65c14f upstream.
      
      When we fail to add the request to the I/O queue, we currently leave it
      to the caller to free the failed request. However since some of the
      requests that fail are actually created by nfs_pageio_add_request()
      itself, and are not passed back the caller, this leads to a leakage
      issue, which can again cause page locks to leak.
      
      This commit addresses the leakage by freeing the created requests on
      error, using desc->pg_completion_ops->error_cleanup()
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Fixes: a7d42ddb ("nfs: add mirroring support to pgio layer")
      Cc: stable@vger.kernel.org # v4.0: c18b96a1: nfs: clean up rest of reqs
      Cc: stable@vger.kernel.org # v4.0: d600ad1f: NFS41: pop some layoutget
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be74fddc
    • Z
      jbd2: fix compile warning when using JBUFFER_TRACE · 584f390d
      zhangyi (F) 提交于
      commit 01215d3edb0f384ddeaa5e4a22c1ae5ff634149f upstream.
      
      The jh pointer may be used uninitialized in the two cases below and the
      compiler complain about it when enabling JBUFFER_TRACE macro, fix them.
      
      In file included from fs/jbd2/transaction.c:19:0:
      fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
      ./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
       #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                            ^
      fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
        struct journal_head *jh;
                             ^
      In file included from fs/jbd2/transaction.c:19:0:
      fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
      ./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
       #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                            ^
      fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
        struct journal_head *jh;
                             ^
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      584f390d
    • Z
      jbd2: clear dirty flag when revoking a buffer from an older transaction · dbe4bc99
      zhangyi (F) 提交于
      commit 904cdbd41d749a476863a0ca41f6f396774f26e4 upstream.
      
      Now, we capture a data corruption problem on ext4 while we're truncating
      an extent index block. Imaging that if we are revoking a buffer which
      has been journaled by the committing transaction, the buffer's jbddirty
      flag will not be cleared in jbd2_journal_forget(), so the commit code
      will set the buffer dirty flag again after refile the buffer.
      
      fsx                               kjournald2
                                        jbd2_journal_commit_transaction
      jbd2_journal_revoke                commit phase 1~5...
       jbd2_journal_forget
         belongs to older transaction    commit phase 6
         jbddirty not clear               __jbd2_journal_refile_buffer
                                           __jbd2_journal_unfile_buffer
                                            test_clear_buffer_jbddirty
                                             mark_buffer_dirty
      
      Finally, if the freed extent index block was allocated again as data
      block by some other files, it may corrupt the file data after writing
      cached pages later, such as during unmount time. (In general,
      clean_bdev_aliases() related helpers should be invoked after
      re-allocation to prevent the above corruption, but unfortunately we
      missed it when zeroout the head of extra extent blocks in
      ext4_ext_handle_unwritten_extents()).
      
      This patch mark buffer as freed and set j_next_transaction to the new
      transaction when it already belongs to the committing transaction in
      jbd2_journal_forget(), so that commit code knows it should clear dirty
      bits when it is done with the buffer.
      
      This problem can be reproduced by xfstests generic/455 easily with
      seeds (3246 3247 3248 3249).
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbe4bc99
    • J
      ext2: Fix underflow in ext2_max_size() · 62600af3
      Jan Kara 提交于
      commit 1c2d14212b15a60300a2d4f6364753e87394c521 upstream.
      
      When ext2 filesystem is created with 64k block size, ext2_max_size()
      will return value less than 0. Also, we cannot write any file in this fs
      since the sb->maxbytes is less than 0. The core of the problem is that
      the size of block index tree for such large block size is more than
      i_blocks can carry. So fix the computation to count with this
      possibility.
      
      File size limits computed with the new function for the full range of
      possible block sizes look like:
      
      bits file_size
      10     17247252480
      11    275415851008
      12   2196873666560
      13   2197948973056
      14   2198486220800
      15   2198754754560
      16   2198888906752
      
      CC: stable@vger.kernel.org
      Reported-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62600af3
    • J
      ext4: fix crash during online resizing · 8a4fdc64
      Jan Kara 提交于
      commit f96c3ac8dfc24b4e38fc4c2eba5fea2107b929d1 upstream.
      
      When computing maximum size of filesystem possible with given number of
      group descriptor blocks, we forget to include s_first_data_block into
      the number of blocks. Thus for filesystems with non-zero
      s_first_data_block it can happen that computed maximum filesystem size
      is actually lower than current filesystem size which confuses the code
      and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on
      flex_gd->count == 0. The problem can be reproduced like:
      
      truncate -s 100g /tmp/image
      mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768
      mount -t ext4 -o loop /tmp/image /mnt
      resize2fs /dev/loop0 262145
      resize2fs /dev/loop0 300000
      
      Fix the problem by properly including s_first_data_block into the
      computed number of filesystem blocks.
      
      Fixes: 1c6bd717 "ext4: convert file system to meta_bg if needed..."
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a4fdc64
    • Y
      ext4: add mask of ext4 flags to swap · a0d876c7
      yangerkun 提交于
      commit abdc644e8cbac2e9b19763680e5a7cf9bab2bee7 upstream.
      
      The reason is that while swapping two inode, we swap the flags too.
      Some flags such as EXT4_JOURNAL_DATA_FL can really confuse the things
      since we're not resetting the address operations structure.  The
      simplest way to keep things sane is to restrict the flags that can be
      swapped.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0d876c7
    • Y
      ext4: update quota information while swapping boot loader inode · 048bfb5b
      yangerkun 提交于
      commit aa507b5faf38784defe49f5e64605ac3c4425e26 upstream.
      
      While do swap between two inode, they swap i_data without update
      quota information. Also, swap_inode_boot_loader can do "revert"
      somtimes, so update the quota while all operations has been finished.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      048bfb5b
    • Y
      ext4: cleanup pagecache before swap i_data · 071f6816
      yangerkun 提交于
      commit a46c68a318b08f819047843abf349aeee5d10ac2 upstream.
      
      While do swap, we should make sure there has no new dirty page since we
      should swap i_data between two inode:
      1.We should lock i_mmap_sem with write to avoid new pagecache from mmap
      read/write;
      2.Change filemap_flush to filemap_write_and_wait and move them to the
      space protected by inode lock to avoid new pagecache from buffer read/write.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      071f6816
    • Y
      ext4: fix check of inode in swap_inode_boot_loader · cdf9941b
      yangerkun 提交于
      commit 67a11611e1a5211f6569044fbf8150875764d1d0 upstream.
      
      Before really do swap between inode and boot inode, something need to
      check to avoid invalid or not permitted operation, like does this inode
      has inline data. But the condition check should be protected by inode
      lock to avoid change while swapping. Also some other condition will not
      change between swapping, but there has no problem to do this under inode
      lock.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdf9941b
    • F
      Btrfs: fix corruption reading shared and compressed extents after hole punching · 898488e2
      Filipe Manana 提交于
      commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream.
      
      In the past we had data corruption when reading compressed extents that
      are shared within the same file and they are consecutive, this got fixed
      by commit 005efedf ("Btrfs: fix read corruption of compressed and
      shared extents") and by commit 808f80b4 ("Btrfs: update fix for read
      corruption of compressed and shared extents"). However there was a case
      that was missing in those fixes, which is when the shared and compressed
      extents are referenced with a non-zero offset. The following shell script
      creates a reproducer for this issue:
      
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdc &> /dev/null
        mount -o compress /dev/sdc /mnt/sdc
      
        # Create a file with 3 consecutive compressed extents, each has an
        # uncompressed size of 128Kb and a compressed size of 4Kb.
        for ((i = 1; i <= 3; i++)); do
            head -c 4096 /dev/zero
            for ((j = 1; j <= 31; j++)); do
                head -c 4096 /dev/zero | tr '\0' "\377"
            done
        done > /mnt/sdc/foobar
        sync
      
        echo "Digest after file creation:   $(md5sum /mnt/sdc/foobar)"
      
        # Clone the first extent into offsets 128K and 256K.
        xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
        xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
        sync
      
        echo "Digest after cloning:         $(md5sum /mnt/sdc/foobar)"
      
        # Punch holes into the regions that are already full of zeroes.
        xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
        sync
      
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        echo "Dropping page cache..."
        sysctl -q vm.drop_caches=1
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        umount /dev/sdc
      
      When running the script we get the following output:
      
        Digest after file creation:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        linked 131072/131072 bytes at offset 131072
        128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
        linked 131072/131072 bytes at offset 262144
        128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
        Digest after cloning:         5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Digest after hole punching:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Dropping page cache...
        Digest after hole punching:   fba694ae8664ed0c2e9ff8937e7f1484  /mnt/sdc/foobar
      
      This happens because after reading all the pages of the extent in the
      range from 128K to 256K for example, we read the hole at offset 256K
      and then when reading the page at offset 260K we don't submit the
      existing bio, which is responsible for filling all the page in the
      range 128K to 256K only, therefore adding the pages from range 260K
      to 384K to the existing bio and submitting it after iterating over the
      entire range. Once the bio completes, the uncompressed data fills only
      the pages in the range 128K to 256K because there's no more data read
      from disk, leaving the pages in the range 260K to 384K unfilled. It is
      just a slightly different variant of what was solved by commit
      005efedf ("Btrfs: fix read corruption of compressed and shared
      extents").
      
      Fix this by forcing a bio submit, during readpages(), whenever we find a
      compressed extent map for a page that is different from the extent map
      for the previous page or has a different starting offset (in case it's
      the same compressed extent), instead of the extent map's original start
      offset.
      
      A test case for fstests follows soon.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 808f80b4 ("Btrfs: update fix for read corruption of compressed and shared extents")
      Fixes: 005efedf ("Btrfs: fix read corruption of compressed and shared extents")
      Cc: stable@vger.kernel.org # 4.3+
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      898488e2
    • J
      btrfs: ensure that a DUP or RAID1 block group has exactly two stripes · 1a00f7fd
      Johannes Thumshirn 提交于
      commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream.
      
      We recently had a customer issue with a corrupted filesystem. When
      trying to mount this image btrfs panicked with a division by zero in
      calc_stripe_length().
      
      The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
      takes this value and divides it by the number of copies the RAID profile
      is expected to have to calculate the amount of data stripes. As a DUP
      profile is expected to have 2 copies this division resulted in 1/2 = 0.
      Later then the 'data_stripes' variable is used as a divisor in the
      stripe length calculation which results in a division by 0 and thus a
      kernel panic.
      
      When encountering a filesystem with a DUP block group and a
      'num_stripes' value unequal to 2, refuse mounting as the image is
      corrupted and will lead to unexpected behaviour.
      
      Code inspection showed a RAID1 block group has the same issues.
      
      Fixes: e06cd3dd ("Btrfs: add validadtion checks for chunk loading")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a00f7fd
    • F
      Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl · 6e24f5a1
      Filipe Manana 提交于
      commit a0873490660246db587849a9e172f2b7b21fa88a upstream.
      
      We are holding a transaction handle when setting an acl, therefore we can
      not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
      if reclaim is triggered by the allocation, therefore setup a nofs context.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e24f5a1
    • F
      Btrfs: setup a nofs context for memory allocation at btrfs_create_tree() · 61f92096
      Filipe Manana 提交于
      commit b89f6d1fcb30a8cbdc18ce00c7d93792076af453 upstream.
      
      We are holding a transaction handle when creating a tree, therefore we can
      not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is
      triggered by the allocation, therefore setup a nofs context.
      
      Fixes: 74e4d827 ("btrfs: let callers of btrfs_alloc_root pass gfp flags")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61f92096
    • V
      ovl: Do not lose security.capability xattr over metadata file copy-up · 205f149f
      Vivek Goyal 提交于
      commit 993a0b2aec52754f0897b1dab4c453be8217cae5 upstream.
      
      If a file has been copied up metadata only, and later data is copied up,
      upper loses any security.capability xattr it has (underlying filesystem
      clears it as upon file write).
      
      From a user's point of view, this is just a file copy-up and that should
      not result in losing security.capability xattr.  Hence, before data copy
      up, save security.capability xattr (if any) and restore it on upper after
      data copy up is complete.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: 0c288874 ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper")
      Cc: <stable@vger.kernel.org> # v4.19+
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      205f149f
    • V
      ovl: During copy up, first copy up data and then xattrs · 6f048ae2
      Vivek Goyal 提交于
      commit 5f32879ea35523b9842bdbdc0065e13635caada2 upstream.
      
      If a file with capability set (and hence security.capability xattr) is
      written kernel clears security.capability xattr. For overlay, during file
      copy up if xattrs are copied up first and then data is, copied up. This
      means data copy up will result in clearing of security.capability xattr
      file on lower has. And this can result into surprises. If a lower file has
      CAP_SETUID, then it should not be cleared over copy up (if nothing was
      actually written to file).
      
      This also creates problems with chown logic where it first copies up file
      and then tries to clear setuid bit. But by that time security.capability
      xattr is already gone (due to data copy up), and caller gets -ENODATA.
      This has been reported by Giuseppe here.
      
      https://github.com/containers/libpod/issues/2015#issuecomment-447824842
      
      Fix this by copying up data first and then metadta. This is a regression
      which has been introduced by my commit as part of metadata only copy up
      patches.
      
      TODO: There will be some corner cases where a file is copied up metadata
      only and later data copy up happens and that will clear security.capability
      xattr. Something needs to be done about that too.
      
      Fixes: bd64e575 ("ovl: During copy up, first copy up metadata and then data")
      Cc: <stable@vger.kernel.org> # v4.19+
      Reported-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f048ae2
    • J
      splice: don't merge into linked buffers · 2af926fd
      Jann Horn 提交于
      commit a0ce2f0aa6ad97c3d4927bf2ca54bcebdf062d55 upstream.
      
      Before this patch, it was possible for two pipes to affect each other after
      data had been transferred between them with tee():
      
      ============
      $ cat tee_test.c
      
      int main(void) {
        int pipe_a[2];
        if (pipe(pipe_a)) err(1, "pipe");
        int pipe_b[2];
        if (pipe(pipe_b)) err(1, "pipe");
        if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
        if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
        if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");
      
        char buf[5];
        if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
        buf[4] = 0;
        printf("got back: '%s'\n", buf);
      }
      $ gcc -o tee_test tee_test.c
      $ ./tee_test
      got back: 'abxx'
      $
      ============
      
      As suggested by Al Viro, fix it by creating a separate type for
      non-mergeable pipe buffers, then changing the types of buffers in
      splice_pipe_to_pipe() and link_pipe().
      
      Cc: <stable@vger.kernel.org>
      Fixes: 7c77f0b3 ("splice: implement pipe to pipe splicing")
      Fixes: 70524490 ("[PATCH] splice: add support for sys_tee()")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2af926fd
    • V
      fs/devpts: always delete dcache dentry-s in dput() · 1c2123ff
      Varad Gautam 提交于
      commit 73052b0daee0b750b39af18460dfec683e4f5887 upstream.
      
      d_delete only unhashes an entry if it is reached with
      dentry->d_lockref.count != 1. Prior to commit 8ead9dd5 ("devpts:
      more pty driver interface cleanups"), d_delete was called on a dentry
      from devpts_pty_kill with two references held, which would trigger the
      unhashing, and the subsequent dputs would release it.
      
      Commit 8ead9dd5 reworked devpts_pty_kill to stop acquiring the second
      reference from d_find_alias, and the d_delete call left the dentries
      still on the hashed list without actually ever being dropped from dcache
      before explicit cleanup. This causes the number of negative dentries for
      devpts to pile up, and an `ls /dev/pts` invocation can take seconds to
      return.
      
      Provide always_delete_dentry() from simple_dentry_operations
      as .d_delete for devpts, to make the dentry be dropped from dcache.
      
      Without this cleanup, the number of dentries in /dev/pts/ can be grown
      arbitrarily as:
      
      `python -c 'import pty; pty.spawn(["ls", "/dev/pts"])'`
      
      A systemtap probe on dcache_readdir to count d_subdirs shows this count
      to increase with each pty spawn invocation above:
      
      probe kernel.function("dcache_readdir") {
          subdirs = &@cast($file->f_path->dentry, "dentry")->d_subdirs;
          p = subdirs;
          p = @cast(p, "list_head")->next;
          i = 0
          while (p != subdirs) {
            p = @cast(p, "list_head")->next;
            i = i+1;
          }
          printf("number of dentries: %d\n", i);
      }
      
      Fixes: 8ead9dd5 ("devpts: more pty driver interface cleanups")
      Signed-off-by: NVarad Gautam <vrd@amazon.de>
      Reported-by: NZheng Wang <wanz@amazon.de>
      Reported-by: NBrandon Schwartz <bsschwar@amazon.de>
      Root-caused-by: NMaximilian Heyne <mheyne@amazon.de>
      Root-caused-by: NNicolas Pernas Maradei <npernas@amazon.de>
      CC: David Woodhouse <dwmw@amazon.co.uk>
      CC: Maximilian Heyne <mheyne@amazon.de>
      CC: Stefan Nuernberger <snu@amazon.de>
      CC: Amit Shah <aams@amazon.de>
      CC: Linus Torvalds <torvalds@linux-foundation.org>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: Matthew Wilcox <willy@infradead.org>
      CC: Eric Biggers <ebiggers@google.com>
      CC: <stable@vger.kernel.org> # 4.9+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c2123ff
    • P
      CIFS: Fix read after write for files with read caching · 43eaa6cc
      Pavel Shilovsky 提交于
      commit 6dfbd84684700cb58b34e8602c01c12f3d2595c8 upstream.
      
      When we have a READ lease for a file and have just issued a write
      operation to the server we need to purge the cache and set oplock/lease
      level to NONE to avoid reading stale data. Currently we do that
      only if a write operation succedeed thus not covering cases when
      a request was sent to the server but a negative error code was
      returned later for some other reasons (e.g. -EIOCBQUEUED or -EINTR).
      Fix this by turning off caching regardless of the error code being
      returned.
      
      The patches fixes generic tests 075 and 112 from the xfs-tests.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43eaa6cc
    • P
      CIFS: Do not skip SMB2 message IDs on send failures · dc8e8ad9
      Pavel Shilovsky 提交于
      commit c781af7e0c1fed9f1d0e0ec31b86f5b21a8dca17 upstream.
      
      When we hit failures during constructing MIDs or sending PDUs
      through the network, we end up not using message IDs assigned
      to the packet. The next SMB packet will skip those message IDs
      and continue with the next one. This behavior may lead to a server
      not granting us credits until we use the skipped IDs. Fix this by
      reverting the current ID to the original value if any errors occur
      before we push the packet through the network stack.
      
      This patch fixes the generic/310 test from the xfs-tests.
      
      Cc: <stable@vger.kernel.org> # 4.19.x
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc8e8ad9