1. 18 9月, 2014 6 次提交
  2. 15 9月, 2014 3 次提交
    • L
      vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4
      Linus Torvalds 提交于
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9226b5b4
    • A
      be careful with nd->inode in path_init() and follow_dotdot_rcu() · 4023bfc9
      Al Viro 提交于
      in the former we simply check if dentry is still valid after picking
      its ->d_inode; in the latter we fetch ->d_inode in the same places
      where we fetch dentry and its ->d_seq, under the same checks.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4023bfc9
    • A
      don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 7bd88377
      Al Viro 提交于
      return the value instead, and have path_init() do the assignment.  Broken by
      "vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
      which was Cc-stable with 2.6.38+ as destination.  This one should go where
      it went.
      
      To avoid dummy value returned in case when root is already set (it would do
      no harm, actually, since the only caller that doesn't ignore the return value
      is guaranteed to have nd->root *not* set, but it's more obvious that way),
      lift the check into callers.  And do the same to set_root(), to keep them
      in sync.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bd88377
  3. 14 9月, 2014 3 次提交
    • A
      fix bogus read_seqretry() checks introduced in b37199e6 · f5be3e29
      Al Viro 提交于
      read_seqretry() returns true on mismatch, not on match...
      
      Cc: stable@vger.kernel.org # 3.15+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f5be3e29
    • A
      move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) · 6f18493e
      Al Viro 提交于
      and lock the right list there
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6f18493e
    • L
      vfs: fix bad hashing of dentries · 99d263d4
      Linus Torvalds 提交于
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99d263d4
  4. 11 9月, 2014 4 次提交
  5. 09 9月, 2014 7 次提交
    • J
      nfs: revert "nfs4: queue free_lock_state job submission to nfsiod" · 0c0e0d3c
      Jeff Layton 提交于
      This reverts commit 49a4bda2.
      
      Christoph reported an oops due to the above commit:
      
      generic/089 242s ...[ 2187.041239] general protection fault: 0000 [#1]
      SMP
      [ 2187.042899] Modules linked in:
      [ 2187.044000] CPU: 0 PID: 11913 Comm: kworker/0:1 Not tainted 3.16.0-rc6+ #1151
      [ 2187.044287] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [ 2187.044287] Workqueue: nfsiod free_lock_state_work
      [ 2187.044287] task: ffff880072b50cd0 ti: ffff88007a4ec000 task.ti: ffff88007a4ec000
      [ 2187.044287] RIP: 0010:[<ffffffff81361ca6>]  [<ffffffff81361ca6>] free_lock_state_work+0x16/0x30
      [ 2187.044287] RSP: 0018:ffff88007a4efd58  EFLAGS: 00010296
      [ 2187.044287] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007a947ac0 RCX: 8000000000000000
      [ 2187.044287] RDX: ffffffff826af9e0 RSI: ffff88007b093c00 RDI: ffff88007b093db8
      [ 2187.044287] RBP: ffff88007a4efd58 R08: ffffffff832d3e10 R09: 000001c40efc0000
      [ 2187.044287] R10: 0000000000000000 R11: 0000000000059e30 R12: ffff88007fc13240
      [ 2187.044287] R13: ffff88007fc18b00 R14: ffff88007b093db8 R15: 0000000000000000
      [ 2187.044287] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [ 2187.044287] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 2187.044287] CR2: 00007f93ec33fb80 CR3: 0000000079dc2000 CR4: 00000000000006f0
      [ 2187.044287] Stack:
      [ 2187.044287]  ffff88007a4efdd8 ffffffff810cc877 ffffffff810cc80d ffff88007fc13258
      [ 2187.044287]  000000007a947af0 0000000000000000 ffffffff8353ccc8 ffffffff82b6f3d0
      [ 2187.044287]  0000000000000000 ffffffff82267679 ffff88007a4efdd8 ffff88007fc13240
      [ 2187.044287] Call Trace:
      [ 2187.044287]  [<ffffffff810cc877>] process_one_work+0x1c7/0x490
      [ 2187.044287]  [<ffffffff810cc80d>] ? process_one_work+0x15d/0x490
      [ 2187.044287]  [<ffffffff810cd569>] worker_thread+0x119/0x4f0
      [ 2187.044287]  [<ffffffff810fbbad>] ? trace_hardirqs_on+0xd/0x10
      [ 2187.044287]  [<ffffffff810cd450>] ? init_pwq+0x190/0x190
      [ 2187.044287]  [<ffffffff810d3c6f>] kthread+0xdf/0x100
      [ 2187.044287]  [<ffffffff810d3b90>] ? __init_kthread_worker+0x70/0x70
      [ 2187.044287]  [<ffffffff81d9873c>] ret_from_fork+0x7c/0xb0
      [ 2187.044287]  [<ffffffff810d3b90>] ? __init_kthread_worker+0x70/0x70
      [ 2187.044287] Code: 0f 1f 44 00 00 31 c0 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d b7 48 fe ff ff 48 8b 87 58 fe ff ff 48 89 e5 48 8b 40 30 <48> 8b 00 48 8b 10 48 89 c7 48 8b 92 90 03 00 00 ff 52 28 5d c3
      [ 2187.044287] RIP  [<ffffffff81361ca6>] free_lock_state_work+0x16/0x30
      [ 2187.044287]  RSP <ffff88007a4efd58>
      [ 2187.103626] ---[ end trace 0f11326d28e5d8fa ]---
      
      The original reason for this patch was because the fl_release_private
      operation couldn't sleep. With commit ed9814d8 (locks: defer freeing
      locks in locks_delete_lock until after i_lock has been dropped), this is
      no longer a problem so we can revert this patch.
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      0c0e0d3c
    • C
      nfs: fix kernel warning when removing proc entry · 21e81002
      Cong Wang 提交于
      I saw the following kernel warning:
      
      [ 1852.321222] ------------[ cut here ]------------
      [ 1852.326527] WARNING: CPU: 0 PID: 118 at fs/proc/generic.c:521 remove_proc_entry+0x154/0x16b()
      [ 1852.335630] remove_proc_entry: removing non-empty directory 'fs/nfsfs', leaking at least 'volumes'
      [ 1852.344084] CPU: 0 PID: 118 Comm: kworker/u8:2 Not tainted 3.16.0+ #540
      [ 1852.350036] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 1852.354992] Workqueue: netns cleanup_net
      [ 1852.358701]  0000000000000000 ffff880116f2fbd0 ffffffff819c03e9 ffff880116f2fc18
      [ 1852.366474]  ffff880116f2fc08 ffffffff810744ee ffffffff811e0e6e ffff8800d4e96238
      [ 1852.373507]  ffffffff81dbe665 ffff8800d46a5948 0000000000000005 ffff880116f2fc68
      [ 1852.380224] Call Trace:
      [ 1852.381976]  [<ffffffff819c03e9>] dump_stack+0x4d/0x66
      [ 1852.385495]  [<ffffffff810744ee>] warn_slowpath_common+0x7a/0x93
      [ 1852.389869]  [<ffffffff811e0e6e>] ? remove_proc_entry+0x154/0x16b
      [ 1852.393987]  [<ffffffff8107457b>] warn_slowpath_fmt+0x4c/0x4e
      [ 1852.397999]  [<ffffffff811e0e6e>] remove_proc_entry+0x154/0x16b
      [ 1852.402034]  [<ffffffff8129c73d>] nfs_fs_proc_net_exit+0x53/0x56
      [ 1852.406136]  [<ffffffff812a103b>] nfs_net_exit+0x12/0x1d
      [ 1852.409774]  [<ffffffff81785bc9>] ops_exit_list+0x44/0x55
      [ 1852.413529]  [<ffffffff81786389>] cleanup_net+0xee/0x182
      [ 1852.417198]  [<ffffffff81088c9e>] process_one_work+0x209/0x40d
      [ 1852.502320]  [<ffffffff81088bf7>] ? process_one_work+0x162/0x40d
      [ 1852.587629]  [<ffffffff810890c1>] worker_thread+0x1f0/0x2c7
      [ 1852.673291]  [<ffffffff81088ed1>] ? process_scheduled_works+0x2f/0x2f
      [ 1852.759470]  [<ffffffff8108e079>] kthread+0xc9/0xd1
      [ 1852.843099]  [<ffffffff8109427f>] ? finish_task_switch+0x3a/0xce
      [ 1852.926518]  [<ffffffff8108dfb0>] ? __kthread_parkme+0x61/0x61
      [ 1853.008565]  [<ffffffff819cbeac>] ret_from_fork+0x7c/0xb0
      [ 1853.076477]  [<ffffffff8108dfb0>] ? __kthread_parkme+0x61/0x61
      [ 1853.140653] ---[ end trace 69c4c6617f78e32d ]---
      
      It looks wrong that we add "/proc/net/nfsfs" in nfs_fs_proc_net_init()
      while remove "/proc/fs/nfsfs" in nfs_fs_proc_net_exit().
      
      Fixes: commit 65b38851 (NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes)
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Dan Aloni <dan@kernelim.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      [Trond: replace uses of remove_proc_entry() with remove_proc_subtree()
      as suggested by Al Viro]
      Cc: stable@vger.kernel.org # 3.4.x : 65b38851: NFS: Fix /proc/fs/nfsfs/servers
      Cc: stable@vger.kernel.org # 3.4.x
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      21e81002
    • C
      Btrfs: use insert_inode_locked4 for inode creation · b0d5d10f
      Chris Mason 提交于
      Btrfs was inserting inodes into the hash table before we had fully
      set the inode up on disk.  This leaves us open to rare races that allow
      two different inodes in memory for the same [root, inode] pair.
      
      This patch fixes things by using insert_inode_locked4 to insert an I_NEW
      inode and unlock_new_inode when we're ready for the rest of the kernel
      to use the inode.
      
      It also makes sure to init the operations pointers on the inode before
      going into the error handling paths.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0d5d10f
    • F
      Btrfs: fix fsync data loss after a ranged fsync · 49dae1bc
      Filipe Manana 提交于
      While we're doing a full fsync (when the inode has the flag
      BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
      portion of the file), we might have ordered operations that are started
      before or while we're logging the inode and that fall outside the fsync
      range.
      
      Therefore when a full ranged fsync finishes don't remove every extent
      map from the list of modified extent maps - as for some of them, that
      fall outside our fsync range, their respective ordered operation hasn't
      finished yet, meaning the corresponding file extent item wasn't inserted
      into the fs/subvol tree yet and therefore we didn't log it, and we must
      let the next fast fsync (one that checks only the modified list) see this
      extent map and log a matching file extent item to the log btree and wait
      for its ordered operation to finish (if it's still ongoing).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      49dae1bc
    • D
      Btrfs: kfree()ing ERR_PTRs · c47ca32d
      Dan Carpenter 提交于
      The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
      btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.
      
      These kind of bugs are "One Err Bugs" where there is just one error
      label that does everything.  I could set the "inherit = NULL" and keep
      the single out label but it ends up being more complicated that way.  It
      makes the code simpler to re-order the unwind so it's in the mirror
      order of the allocation and introduce some new error labels.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c47ca32d
    • J
      lockd: fix rpcbind crash on lockd startup failure · 7c17705e
      J. Bruce Fields 提交于
      Nikita Yuschenko reported that booting a kernel with init=/bin/sh and
      then nfs mounting without portmap or rpcbind running using a busybox
      mount resulted in:
      
        # mount -t nfs 10.30.130.21:/opt /mnt
        svc: failed to register lockdv1 RPC service (errno 111).
        lockd_up: makesock failed, error=-111
        Unable to handle kernel paging request for data at address 0x00000030
        Faulting instruction address: 0xc055e65c
        Oops: Kernel access of bad area, sig: 11 [#1]
        MPC85xx CDS
        Modules linked in:
        CPU: 0 PID: 1338 Comm: mount Not tainted 3.10.44.cge #117
        task: cf29cea0 ti: cf35c000 task.ti: cf35c000
        NIP: c055e65c LR: c0566490 CTR: c055e648
        REGS: cf35dad0 TRAP: 0300   Not tainted  (3.10.44.cge)
        MSR: 00029000 <CE,EE,ME>  CR: 22442488  XER: 20000000
        DEAR: 00000030, ESR: 00000000
      
        GPR00: c05606f4 cf35db80 cf29cea0 cf0ded80 cf0dedb8 00000001 1dec3086
        00000000
        GPR08: 00000000 c07b1640 00000007 1dec3086 22442482 100b9758 00000000
        10090ae8
        GPR16: 00000000 000186a5 00000000 00000000 100c3018 bfa46edc 100b0000
        bfa46ef0
        GPR24: cf386ae0 c07834f0 00000000 c0565f88 00000001 cf0dedb8 00000000
        cf0ded80
        NIP [c055e65c] call_start+0x14/0x34
        LR [c0566490] __rpc_execute+0x70/0x250
        Call Trace:
        [cf35db80] [00000080] 0x80 (unreliable)
        [cf35dbb0] [c05606f4] rpc_run_task+0x9c/0xc4
        [cf35dbc0] [c0560840] rpc_call_sync+0x50/0xb8
        [cf35dbf0] [c056ee90] rpcb_register_call+0x54/0x84
        [cf35dc10] [c056f24c] rpcb_register+0xf8/0x10c
        [cf35dc70] [c0569e18] svc_unregister.isra.23+0x100/0x108
        [cf35dc90] [c0569e38] svc_rpcb_cleanup+0x18/0x30
        [cf35dca0] [c0198c5c] lockd_up+0x1dc/0x2e0
        [cf35dcd0] [c0195348] nlmclnt_init+0x2c/0xc8
        [cf35dcf0] [c015bb5c] nfs_start_lockd+0x98/0xec
        [cf35dd20] [c015ce6c] nfs_create_server+0x1e8/0x3f4
        [cf35dd90] [c0171590] nfs3_create_server+0x10/0x44
        [cf35dda0] [c016528c] nfs_try_mount+0x158/0x1e4
        [cf35de20] [c01670d0] nfs_fs_mount+0x434/0x8c8
        [cf35de70] [c00cd3bc] mount_fs+0x20/0xbc
        [cf35de90] [c00e4f88] vfs_kern_mount+0x50/0x104
        [cf35dec0] [c00e6e0c] do_mount+0x1d0/0x8e0
        [cf35df10] [c00e75ac] SyS_mount+0x90/0xd0
        [cf35df40] [c000ccf4] ret_from_syscall+0x0/0x3c
      
      The addition of svc_shutdown_net() resulted in two calls to
      svc_rpcb_cleanup(); the second is no longer necessary and crashes when
      it calls rpcb_register_call with clnt=NULL.
      Reported-by: NNikita Yushchenko <nyushchenko@dev.rtsoft.ru>
      Fixes: 679b033d "lockd: ensure we tear down any live sockets when socket creation fails during lockd_up"
      Cc: stable@vger.kernel.org
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      7c17705e
    • J
      nfsd4: fix rd_dircount enforcement · aee37764
      J. Bruce Fields 提交于
      Commit 3b299709 "nfsd4: enforce rd_dircount" totally misunderstood
      rd_dircount; it refers to total non-attribute bytes returned, not number
      of directory entries returned.
      
      Bring the code into agreement with RFC 3530 section 14.2.24.
      
      Cc: stable@vger.kernel.org
      Fixes: 3b299709 "nfsd4: enforce rd_dircount"
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      aee37764
  6. 08 9月, 2014 1 次提交
  7. 05 9月, 2014 8 次提交
  8. 04 9月, 2014 3 次提交
  9. 03 9月, 2014 4 次提交
    • T
      ext4: avoid trying to kfree an ERR_PTR pointer · a9cfcd63
      Theodore Ts'o 提交于
      Thanks to Dan Carpenter for extending smatch to find bugs like this.
      (This was found using a development version of smatch.)
      
      Fixes: 36de9286
      Reported-by: Dan Carpenter <dan.carpenter@oracle.com
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      a9cfcd63
    • F
      Btrfs: fix crash while doing a ranged fsync · dac5705c
      Filipe Manana 提交于
      While doing a ranged fsync, that is, one whose range doesn't cover the
      whole possible file range (0 to LLONG_MAX), we can crash under certain
      circumstances with a trace like the following:
      
      [41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      (...)
      [41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
      (...)
      [41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
      (...)
      [41074.644919] Stack:
      (...)
      [41074.644919] Call Trace:
      [41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
      [41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
      [41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
      [41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
      [41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
      [41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
      [41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
      [41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
      [41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
      [41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
      [41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
      (...)
      
      The necessary conditions that lead to such crash are:
      
      * an incremental fsync (when the inode doesn't have the
        BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
        a file extent item ending at offset X;
      
      * the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
        to a file truncate operation that reduces the file to a size smaller
        than X;
      
      * a ranged fsync call happens (via an msync for example), with a range that
        doesn't cover the whole file and the end of this range, lets call it Y, is
        smaller than X;
      
      * btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
        calls btrfs_truncate_inode_items() to remove all items from the log
        tree that are associated with our file;
      
      * btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
        file extent item it removed is the one ending at offset X, where X > 0 and
        X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
        parameter set to X;
      
      * btrfs_ordered_update_i_size() sees that X is greater then the current ordered
        size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
        ordered operation with a range covering the offset X, calling a BUG_ON() if
        such ordered operation exists. This assumption is made because the disk_i_size
        is only increased after the corresponding file extent item is added to the
        btree (btrfs_finish_ordered_io);
      
      * But because our fsync covers only a limited range, such an ordered extent might
        exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
        extent to finish when calling btrfs_wait_ordered_range();
      
      And then by the time btrfs_ordered_update_i_size() is called, via:
      
         btrfs_sync_file() ->
             btrfs_log_dentry_safe() ->
                 btrfs_log_inode_parent() ->
                     btrfs_log_inode() ->
                         btrfs_truncate_inode_items() ->
                             btrfs_ordered_update_i_size()
      
      We hit the BUG_ON(), which could never happen if the fsync range covered the whole
      possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
      finish before calling btrfs_truncate_inode_items().
      
      So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
      from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
      
      Issue found while running xfstests/generic/127 (happens very rarely for me), more
      specifically via the fsx calls that use memory mapped IO (and issue msync calls).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dac5705c
    • F
      Btrfs: fix corruption after write/fsync failure + fsync + log recovery · d9f85963
      Filipe Manana 提交于
      While writing to a file, in inode.c:cow_file_range() (and same applies to
      submit_compressed_extents()), after reserving an extent for the file data,
      we create a new extent map for the written range and insert it into the
      extent map cache. After that, we create an ordered operation, but if it
      fails (due to a transient/temporary-ENOMEM), we return without dropping
      that extent map, which points to a reserved extent that is freed when we
      return. A subsequent incremental fsync (when the btrfs inode doesn't have
      the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
      logs a file extent item based on that extent map, which points to a disk
      extent that doesn't contain valid data - it was freed by us earlier, at this
      point it might contain any random/garbage data.
      
      Therefore, if we reach an error condition when cowing a file range after
      we added the new extent map to the cache, drop it from the cache before
      returning.
      
      Some sequence of steps that lead to this:
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o commit=9999 /dev/sdd /mnt
          $ cd /mnt
      
          $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
          $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
          $ sync
      
          $ od -t x1 foo
          0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
          *
          0020000
      
          $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
      
          # Now this write + fsync fail with -ENOMEM, which was returned by
          # btrfs_add_ordered_extent() in inode.c:cow_file_range().
          $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
          $ xfs_io -c "fsync" foo
          fsync: Cannot allocate memory
      
          # Now do a new write + fsync, which will succeed. Our previous
          # -ENOMEM was a transient/temporary error.
          $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
          $ xfs_io -c "fsync" foo
      
          # Our file content (in page cache) is now:
          $ od -t x1 foo
          0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
          *
          0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # Now reboot the machine, and mount the fs, so that fsync log replay
          # takes place.
      
          # The file content is now weird, in particular the first 8Kb, which
          # do not match our data before nor after the sync command above.
          $ od -t x1 foo
          0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # In fact these first 4Kb are a duplicate of the last 4kb block.
          # The last write got an extent map/file extent item that points to
          # the same disk extent that we got in the write+fsync that failed
          # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
          # verify that:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      		extent data disk byte 12582912 nr 8192
      		extent data offset 0 nr 8192 ram 8192
      	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
      		extent data disk byte 0 nr 0
      		extent data offset 0 nr 8192 ram 8192
      	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
      		extent data disk byte 12582912 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      
          $ umount /dev/sdd
          $ btrfsck /dev/sdd
          Checking filesystem on /dev/sdd
          UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
          checking extents
          extent item 12582912 has multiple extent items
          ref mismatch on [12582912 4096] extent item 1, found 2
          Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
          backpointer mismatch on [12582912 4096]
          Errors found in extent allocation tree or chunk allocation
          checking free space cache
          checking fs roots
          root 5 inode 257 errors 1000, some csum missing
          found 131074 bytes used err is 1
          total csum bytes: 4
          total tree bytes: 131072
          total fs tree bytes: 32768
          total extent tree bytes: 16384
          btree space waste bytes: 123404
          file data blocks allocated: 274432
           referenced 274432
          Btrfs v3.14.1-96-gcc7fd5a-dirty
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d9f85963
    • J
      aio: add missing smp_rmb() in read_events_ring · 2ff396be
      Jeff Moyer 提交于
      We ran into a case on ppc64 running mariadb where io_getevents would
      return zeroed out I/O events.  After adding instrumentation, it became
      clear that there was some missing synchronization between reading the
      tail pointer and the events themselves.  This small patch fixes the
      problem in testing.
      
      Thanks to Zach for helping to look into this, and suggesting the fix.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      2ff396be
  10. 02 9月, 2014 1 次提交
    • C
      f2fs: reposition unlock_new_inode to prevent accessing invalid inode · b73e5282
      Chao Yu 提交于
      As the race condition on the inode cache, following scenario can appear:
      [Thread a]				[Thread b]
      					->f2fs_mkdir
      					  ->f2fs_add_link
      					    ->__f2fs_add_link
      					      ->init_inode_metadata failed here
      ->gc_thread_func
        ->f2fs_gc
          ->do_garbage_collect
            ->gc_data_segment
              ->f2fs_iget
                ->iget_locked
                  ->wait_on_inode
      					  ->unlock_new_inode
              ->move_data_page
      					  ->make_bad_inode
      					  ->iput
      
      When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
      should be set as bad to avoid being accessed by other thread. But in above
      scenario, it allows f2fs to access the invalid inode before this inode was set
      as bad.
      This patch fix the potential problem, and this issue was found by code review.
      
      change log from v1:
       o Add condition judgment in gc_data_segment() suggested by Changman Lee.
       o use iget_failed to simplify code.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b73e5282