1. 18 11月, 2019 18 次提交
  2. 16 11月, 2019 1 次提交
    • D
      afs: Fix race in commit bulk status fetch · a28f239e
      David Howells 提交于
      When a lookup is done, the afs filesystem will perform a bulk status-fetch
      operation on the requested vnode (file) plus the next 49 other vnodes from
      the directory list (in AFS, directory contents are downloaded as blobs and
      parsed locally).  When the results are received, it will speculatively
      populate the inode cache from the extra data.
      
      However, if the lookup races with another lookup on the same directory, but
      for a different file - one that's in the 49 extra fetches, then if the bulk
      status-fetch operation finishes first, it will try and update the inode
      from the other lookup.
      
      If this other inode is still in the throes of being created, however, this
      will cause an assertion failure in afs_apply_status():
      
      	BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags));
      
      on or about fs/afs/inode.c:175 because it expects data to be there already
      that it can compare to.
      
      Fix this by skipping the update if the inode is being created as the
      creator will presumably set up the inode with the same information.
      
      Fixes: 39db9815 ("afs: Fix application of the results of a inline bulk status fetch")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a28f239e
  3. 15 11月, 2019 2 次提交
    • J
      ceph: increment/decrement dio counter on async requests · 6a81749e
      Jeff Layton 提交于
      Ceph can in some cases issue an async DIO request, in which case we can
      end up calling ceph_end_io_direct before the I/O is actually complete.
      That may allow buffered operations to proceed while DIO requests are
      still in flight.
      
      Fix this by incrementing the i_dio_count when issuing an async DIO
      request, and decrement it when tearing down the aio_req.
      
      Fixes: 321fe13c ("ceph: add buffered/direct exclusionary locking for reads and writes")
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6a81749e
    • J
      ceph: take the inode lock before acquiring cap refs · a81bc310
      Jeff Layton 提交于
      Most of the time, we (or the vfs layer) takes the inode_lock and then
      acquires caps, but ceph_read_iter does the opposite, and that can lead
      to a deadlock.
      
      When there are multiple clients treading over the same data, we can end
      up in a situation where a reader takes caps and then tries to acquire
      the inode_lock. Another task holds the inode_lock and issues a request
      to the MDS which needs to revoke the caps, but that can't happen until
      the inode_lock is unwedged.
      
      Fix this by having ceph_read_iter take the inode_lock earlier, before
      attempting to acquire caps.
      
      Fixes: 321fe13c ("ceph: add buffered/direct exclusionary locking for reads and writes")
      Link: https://tracker.ceph.com/issues/36348Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a81bc310
  4. 14 11月, 2019 2 次提交
  5. 12 11月, 2019 2 次提交
    • J
      io_uring: make timeout sequence == 0 mean no sequence · 93bd25bb
      Jens Axboe 提交于
      Currently we make sequence == 0 be the same as sequence == 1, but that's
      not super useful if the intent is really to have a timeout that's just
      a pure timeout.
      
      If the user passes in sqe->off == 0, then don't apply any sequence logic
      to the request, let it purely be driven by the timeout specified.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Reviewed-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      93bd25bb
    • F
      Btrfs: fix log context list corruption after rename exchange operation · e6c61710
      Filipe Manana 提交于
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
      
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        (...)
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        (...)
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        (...)
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: NSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6c61710
  6. 11 11月, 2019 4 次提交
    • A
      ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either · 762c6968
      Al Viro 提交于
      We need to get the underlying dentry of parent; sure, absent the races
      it is the parent of underlying dentry, but there's nothing to prevent
      losing a timeslice to preemtion in the middle of evaluation of
      lower_dentry->d_parent->d_inode, having another process move lower_dentry
      around and have its (ex)parent not pinned anymore and freed on memory
      pressure.  Then we regain CPU and try to fetch ->d_inode from memory
      that is freed by that point.
      
      dentry->d_parent *is* stable here - it's an argument of ->lookup() and
      we are guaranteed that it won't be moved anywhere until we feed it
      to d_add/d_splice_alias.  So we safely go that way to get to its
      underlying dentry.
      
      Cc: stable@vger.kernel.org # since 2009 or so
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      762c6968
    • A
      ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable · e72b9dd6
      Al Viro 提交于
      lower_dentry can't go from positive to negative (we have it pinned),
      but it *can* go from negative to positive.  So fetching ->d_inode
      into a local variable, doing a blocking allocation, checking that
      now ->d_inode is non-NULL and feeding the value we'd fetched
      earlier to a function that won't accept NULL is not a good idea.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e72b9dd6
    • A
      ecryptfs: fix unlink and rmdir in face of underlying fs modifications · bcf0d9d4
      Al Viro 提交于
      A problem similar to the one caught in commit 74dd7c97 ("ecryptfs_rename():
      verify that lower dentries are still OK after lock_rename()") exists for
      unlink/rmdir as well.
      
      Instead of playing with dget_parent() of underlying dentry of victim
      and hoping it's the same as underlying dentry of our directory,
      do the following:
              * find the underlying dentry of victim
              * find the underlying directory of victim's parent (stable
      since the victim is ecryptfs dentry and inode of its parent is
      held exclusive by the caller).
              * lock the inode of dentry underlying the victim's parent
              * check that underlying dentry of victim is still hashed and
      has the right parent - it can be moved, but it can't be moved to/from
      the directory we are holding exclusive.  So while ->d_parent itself
      might not be stable, the result of comparison is.
      
      If the check passes, everything is fine - underlying directory is locked,
      underlying victim is still a child of that directory and we can go ahead
      and feed them to vfs_unlink().  As in the current mainline we need to
      pin the underlying dentry of victim, so that it wouldn't go negative under
      us, but that's the only temporary reference that needs to be grabbed there.
      Underlying dentry of parent won't go away (it's pinned by the parent,
      which is held by caller), so there's no need to grab it.
      
      The same problem (with the same solution) exists for rmdir.  Moreover,
      rename gets simpler and more robust with the same "don't bother with
      dget_parent()" approach.
      
      Fixes: 74dd7c97 "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bcf0d9d4
    • A
  7. 09 11月, 2019 1 次提交
  8. 08 11月, 2019 1 次提交
  9. 07 11月, 2019 3 次提交
    • P
      SMB3: Fix persistent handles reconnect · d243af7a
      Pavel Shilovsky 提交于
      When the client hits a network reconnect, it re-opens every open
      file with a create context to reconnect a persistent handle. All
      create context types should be 8-bytes aligned but the padding
      was missed for that one. As a result, some servers don't allow
      us to reconnect handles and return an error. The problem occurs
      when the problematic context is not at the end of the create
      request packet. Fix this by adding a proper padding at the end
      of the reconnect persistent handle context.
      
      Cc: Stable <stable@vger.kernel.org> # 4.19.x
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      d243af7a
    • H
      configfs: calculate the depth of parent item · e2f238f7
      Honggang Li 提交于
      When create symbolic link, create_link should calculate the depth
      of the parent item. However, both the first and second parameters
      of configfs_get_target_path had been set to the target. Broken
      symbolic link created.
      
      $ targetcli ls /
      o- / ............................................................. [...]
        o- backstores .................................................. [...]
        | o- block ...................................... [Storage Objects: 0]
        | o- fileio ..................................... [Storage Objects: 2]
        | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated]
        | | | o- alua ....................................... [ALUA Groups: 1]
        | | |   o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
        | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated]
        | |   o- alua ....................................... [ALUA Groups: 1]
        | |     o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
        | o- pscsi ...................................... [Storage Objects: 0]
        | o- ramdisk .................................... [Storage Objects: 0]
        o- iscsi ................................................ [Targets: 0]
        o- loopback ............................................. [Targets: 0]
        o- srpt ................................................. [Targets: 2]
        | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls]
        | | o- acls ................................................ [ACLs: 2]
        | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2]
        | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK]
        | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK]
        | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2]
        | | |   o- mapped_lun0 ............................. [BROKEN LUN LINK]
        | | |   o- mapped_lun1 ............................. [BROKEN LUN LINK]
        | | o- luns ................................................ [LUNs: 2]
        | |   o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)]
        | |   o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)]
        | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls]
        |   o- acls ................................................ [ACLs: 0]
        |   o- luns ................................................ [LUNs: 0]
        o- vhost ................................................ [Targets: 0]
      
      Fixes: e9c03af2 ("configfs: calculate the symlink target only once")
      Signed-off-by: NHonggang Li <honli@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e2f238f7
    • S
      ocfs2: protect extent tree in ocfs2_prepare_inode_for_write() · e74540b2
      Shuning Zhang 提交于
      When the extent tree is modified, it should be protected by inode
      cluster lock and ip_alloc_sem.
      
      The extent tree is accessed and modified in the
      ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.
      
      The following is a case.  The function ocfs2_fiemap is accessing the
      extent tree, which is modified at the same time.
      
        kernel BUG at fs/ocfs2/extent_map.c:475!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
        CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
        Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
        task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
        RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        Call Trace:
          ocfs2_fiemap+0x1e3/0x430 [ocfs2]
          do_vfs_ioctl+0x155/0x510
          SyS_ioctl+0x81/0xa0
          system_call_fastpath+0x18/0xd8
        Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
        RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        ---[ end trace c8aa0c8180e869dc ]---
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: disabled
      
      This issue can be reproduced every week in a production environment.
      
      This issue is related to the usage mode.  If others use ocfs2 in this
      mode, the kernel will panic frequently.
      
      [akpm@linux-foundation.org: coding style fixes]
      [Fix new warning due to unused function by removing said function - Linus ]
      Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.comSigned-off-by: NShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e74540b2
  10. 05 11月, 2019 4 次提交
    • L
      ceph: don't allow copy_file_range when stripe_count != 1 · a3a08193
      Luis Henriques 提交于
      copy_file_range tries to use the OSD 'copy-from' operation, which simply
      performs a full object copy.  Unfortunately, the implementation of this
      system call assumes that stripe_count is always set to 1 and doesn't take
      into account that the data may be striped across an object set.  If the
      file layout has stripe_count different from 1, then the destination file
      data will be corrupted.
      
      For example:
      
      Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
      stripe_size of 2 MiB; the first half of the file will be filled with 'A's
      and the second half will be filled with 'B's:
      
                     0      4M     8M       Obj1     Obj2
                     +------+------+       +----+   +----+
              file:  | AAAA | BBBB |       | AA |   | AA |
                     +------+------+       |----|   |----|
                                           | BB |   | BB |
                                           +----+   +----+
      
      If we copy_file_range this file into a new file (which needs to have the
      same file layout!), then it will start by copying the object starting at
      file offset 0 (Obj1).  And then it will copy the object starting at file
      offset 4M -- which is Obj1 again.
      
      Unfortunately, the solution for this is to not allow remote object copies
      to be performed when the file layout stripe_count is not 1 and simply
      fallback to the default (VFS) copy_file_range implementation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a3a08193
    • J
      ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open · 5bb5e6ee
      Jeff Layton 提交于
      If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
      that it already passed d_revalidate so we *know* that it's negative (or
      at least was very recently). Just return -ENOENT in that case.
      
      This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
      call atomic_open with the parent's i_rwsem shared, but calling
      d_splice_alias on a hashed dentry requires the exclusive lock.
      
      If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
      open, and another client were to race in and create the file before we
      issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
      the dentry with the new inode with insufficient locks.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5bb5e6ee
    • D
      btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC · a5009d3a
      David Sterba 提交于
      The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
      deprecated and scheduled for removal but we actualy do use them for
      'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351
      should have been just the async flag for subvolume creation.
      
      The deprecation has been added in this development cycle, remove it
      until it's time.
      
      Fixes: ebc87351 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5009d3a
    • J
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik 提交于
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
      
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         __process_pages_contig+0x25a/0x350
         ? extent_clear_unlock_delalloc+0x43/0x70
         submit_compressed_extents+0x359/0x4d0
         normal_work_helper+0x15a/0x330
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         ? rescuer_thread+0x340/0x340
         kthread+0x111/0x130
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      async_extent.
      
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      locked.
      
      I confirmed this was what was happening by installing a debug kernel
      that had
      
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
        }
      
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      
      [ dsterba: the assembly confirms a tiny race window:
      
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
      
        Where r15 is inode and 0x48 is offset of i_size.
      
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
      
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
      
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
      
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      ]
      
      Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d98da499
  11. 01 11月, 2019 2 次提交