1. 15 8月, 2014 9 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
    • F
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
      Filipe Manana 提交于
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      
      For example, consider the following scenario, where we have:
      
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
      
        Leaf N:
      
          slot = 0                           slot = btrfs_header_nritems() - 1
        |-------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        |-------------------------------------------------------------------|
      
        Leaf N + 1:
      
            slot = 0                          slot = btrfs_header_nritems() - 1
        |--------------------------------------------------------------------|
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
        |--------------------------------------------------------------------|
      
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
      
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        |----------------------------------------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
        |----------------------------------------------------------------------------------------------------|
      
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
      
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
      
      and
      
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
      
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      27b9a812
    • T
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d
      Takashi Iwai 提交于
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
       Stack:
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
      Cc: <stable@vger.kernel.org> [v3.11+]
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      4eb1f66d
    • L
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo 提交于
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce62003f
    • M
      btrfs: correctly handle return from ulist_add · f90e579c
      Mark Fasheh 提交于
      ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
      doesn't take into account. As a result, that value can be bubbled up to
      callers, causing an error to be printed. Fix this by only returning the
      value of ulist_add() when it indicates an error.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      f90e579c
    • M
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh 提交于
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      reference.
      
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      counts.
      
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1152651a
    • F
      Btrfs: read lock extent buffer while walking backrefs · 6f7ff6d7
      Filipe Manana 提交于
      Before processing the extent buffer, acquire a read lock on it, so
      that we're safe against concurrent updates on the extent buffer.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6f7ff6d7
    • J
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik 提交于
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e339a6b0
    • D
      btrfs: adjust statfs calculations according to raid profiles · ba7b6e62
      David Sterba 提交于
      This has been discussed in thread:
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528
      
      and this patch implements this proposal:
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536
      
      Works fine for "clean" raid profiles where the raid factor correction
      does the right job. Otherwise it's pessimistic and may show low space
      although there's still some left.
      
      The df nubmers are lightly wrong in case of mixed block groups, but this
      is not a major usecase and can be addressed later.
      
      The RAID56 numbers are wrong almost the same way as before and will be
      addressed separately.
      
      CC: Hugo Mills <hugo@carfax.org.uk>
      CC: cwillu <cwillu@cwillu.com>
      CC: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ba7b6e62
  2. 01 8月, 2014 2 次提交
  3. 30 7月, 2014 1 次提交
  4. 24 7月, 2014 4 次提交
    • V
      fs: umount on symlink leaks mnt count · 295dc39d
      Vasily Averin 提交于
      Currently umount on symlink blocks following umount:
      
      /vz is separate mount
      
      # ls /vz/ -al | grep test
      drwxr-xr-x.  2 root root       4096 Jul 19 01:14 testdir
      lrwxrwxrwx.  1 root root         11 Jul 19 01:16 testlink -> /vz/testdir
      # umount -l /vz/testlink
      umount: /vz/testlink: not mounted (expected)
      
      # lsof /vz
      # umount /vz
      umount: /vz: device is busy. (unexpected)
      
      In this case mountpoint_last() gets an extra refcount on path->mnt
      Signed-off-by: NVasily Averin <vvs@openvz.org>
      Acked-by: NIan Kent <raven@themaw.net>
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      295dc39d
    • B
      direct-io: fix uninitialized warning in do_direct_IO() · 6fcc5420
      Boaz Harrosh 提交于
      The following warnings:
      
        fs/direct-io.c: In function ‘__blockdev_direct_IO’:
        fs/direct-io.c:1011:12: warning: ‘to’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        fs/direct-io.c:913:16: note: ‘to’ was declared here
        fs/direct-io.c:1011:12: warning: ‘from’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        fs/direct-io.c:913:10: note: ‘from’ was declared here
      
      are false positive because dio_get_page() either fails, or sets both
      'from' and 'to'.
      
      Paul Bolle said ...
      Maybe it's better to move initializing "to" and "from" out of
      dio_get_page(). That _might_ make it easier for both the the reader and
      the compiler to understand what's going on. Something like this:
      
      Christoph Hellwig said ...
      The fix of moving the code definitively looks nicer, while I think
      uninitialized_var is horrible wart that won't get anywhere near my code.
      
      Boaz Harrosh: I agree with Christoph and Paul
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      6fcc5420
    • H
      simple_xattr: permit 0-size extended attributes · 4e66d445
      Hugh Dickins 提交于
      If a filesystem uses simple_xattr to support user extended attributes,
      LTP setxattr01 and xfstests generic/062 fail with "Cannot allocate
      memory": simple_xattr_alloc()'s wrap-around test mistakenly excludes
      values of zero size.  Fix that off-by-one (but apparently no filesystem
      needs them yet).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e66d445
    • S
      coredump: fix the setting of PF_DUMPCORE · aed8adb7
      Silesh C V 提交于
      Commit 079148b9 ("coredump: factor out the setting of PF_DUMPCORE")
      cleaned up the setting of PF_DUMPCORE by removing it from all the
      linux_binfmt->core_dump() and moving it to zap_threads().But this ended
      up clearing all the previously set flags.  This causes issues during
      core generation when tsk->flags is checked again (eg.  for PF_USED_MATH
      to dump floating point registers).  Fix this.
      Signed-off-by: NSilesh C V <svellattu@mvista.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aed8adb7
  5. 23 7月, 2014 1 次提交
    • K
      NFSD: Fix crash encoding lock reply on 32-bit · f98bac5a
      Kinglong Mee 提交于
      Commit 8c7424cf "nfsd4: don't try to encode conflicting owner if low
      on space" forgot to free conf->data in nfsd4_encode_lockt and before
      sign conf->data to NULL in nfsd4_encode_lock_denied, causing a leak.
      
      Worse, kfree() can be called on an uninitialized pointer in the case of
      a succesful lock (or one that fails for a reason other than a conflict).
      
      (Note that lock->lk_denied.ld_owner.data appears it should be zero here,
      until you notice that it's one arm of a union the other arm of which is
      written to in the succesful case by the
      
      	memcpy(&lock->lk_resp_stateid, &lock_stp->st_stid.sc_stateid,
      	                                sizeof(stateid_t));
      
      in nfsd4_lock().  In the 32-bit case this overwrites ld_owner.data.)
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Fixes: 8c7424cf ""nfsd4: don't try to encode conflicting owner if low on space"
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f98bac5a
  6. 22 7月, 2014 2 次提交
  7. 20 7月, 2014 2 次提交
    • E
      btrfs: test for valid bdev before kobj removal in btrfs_rm_device · 0bfaa9c5
      Eric Sandeen 提交于
      commit 99994cde btrfs: dev delete should remove sysfs entry
      added a btrfs_kobj_rm_device, which dereferences device->bdev...
      right after we check whether device->bdev might be NULL.
      
      I don't honestly know if it's possible to have a NULL device->bdev
      here, but assuming that it is (given the test), we need to move
      the kobject removal to be under that test.
      
      (Coverity spotted this)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bfaa9c5
    • L
      Btrfs: fix abnormal long waiting in fsync · 98ce2ded
      Liu Bo 提交于
      xfstests generic/127 detected this problem.
      
      With commit 7fc34a62, now fsync will only flush
      data within the passed range.  This is the cause of the above problem,
      -- btrfs's fsync has a stage called 'sync log' which will wait for all the
      ordered extents it've recorded to finish.
      
      In xfstests/generic/127, with mixed operations such as truncate, fallocate,
      punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will
      mmap, and then msync.  And I find that msync will wait for quite a long time
      (about 20s in my case), thanks to ftrace, it turns out that the previous
      fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the
      range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants,
      there can be some ordered extents created but not getting corresponding pages
      flushed, then they're left in memory until we fsync which runs into the
      stage 'sync log', and fsync will just wait for the system writeback thread
      to flush those pages and get ordered extents finished, so the latency is
      inevitable.
      
      This adds a flush similar to btrfs_start_ordered_extent() in
      btrfs_wait_logged_extents() to fix that.
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      98ce2ded
  8. 18 7月, 2014 8 次提交
  9. 16 7月, 2014 1 次提交
  10. 15 7月, 2014 4 次提交
    • D
      xfs: null unused quota inodes when quota is on · 03e01349
      Dave Chinner 提交于
      When quota is on, it is expected that unused quota inodes have a
      value of NULLFSINO. The changes to support a separate project quota
      in 3.12 broken this rule for non-project quota inode enabled
      filesystem, as the code now refuses to write the group quota inode
      if neither group or project quotas are enabled. This regression was
      introduced by commit d892d586 ("xfs: Start using pquotaino from the
      superblock").
      
      In this case, we should be writing NULLFSINO rather than nothing to
      ensure that we leave the group quota inode in a valid state while
      quotas are enabled.
      
      Failure to do so doesn't cause a current kernel to break - the
      separate project quota inodes introduced translation code to always
      treat a zero inode as NULLFSINO. This was introduced by commit
      01026297 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
      also in 3.12 but older kernels do not do this and hence taking a
      filesystem back to an older kernel can result in quotas failing
      initialisation at mount time. When that happens, we see this in
      dmesg:
      
      [ 1649.215390] XFS (sdb): Mounting Filesystem
      [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
      [ 1649.316902] XFS (sdb): Ending clean mount
      
      By ensuring that we write NULLFSINO to quota inodes that aren't
      active, we avoid this problem. We have to be really careful when
      determining if the quota inodes are active or not, because we don't
      want to write a NULLFSINO if the quota inodes are active and we
      simply aren't updating them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      03e01349
    • D
      xfs: refine the allocation stack switch · cf11da9c
      Dave Chinner 提交于
      The allocation stack switch at xfs_bmapi_allocate() has served it's
      purpose, but is no longer a sufficient solution to the stack usage
      problem we have in the XFS allocation path.
      
      Whilst the kernel stack size is now 16k, that is not a valid reason
      for undoing all our "keep stack usage down" modifications. What it
      does allow us to do is have the freedom to refine and perfect the
      modifications knowing that if we get it wrong it won't blow up in
      our faces - we have a safety net now.
      
      This is important because we still have the issue of older kernels
      having smaller stacks and that they are still supported and are
      demonstrating a wide range of different stack overflows.  Red Hat
      has several open bugs for allocation based stack overflows from
      directory modifications and direct IO block allocation and these
      problems still need to be solved. If we can solve them upstream,
      then distro's won't need to bake their own unique solutions.
      
      To that end, I've observed that every allocation based stack
      overflow report has had a specific characteristic - it has happened
      during or directly after a bmap btree block split. That event
      requires a new block to be allocated to the tree, and so we
      effectively stack one allocation stack on top of another, and that's
      when we get into trouble.
      
      A further observation is that bmap btree block splits are much rarer
      than writeback allocation - over a range of different workloads I've
      observed the ratio of bmap btree inserts to splits ranges from 100:1
      (xfstests run) to 10000:1 (local VM image server with sparse files
      that range in the hundreds of thousands to millions of extents).
      Either way, bmap btree split events are much, much rarer than
      allocation events.
      
      Finally, we have to move the kswapd state to the allocation workqueue
      work when allocation is done on behalf of kswapd. This is proving to
      cause significant perturbation in performance under memory pressure
      and appears to be generating allocation deadlock warnings under some
      workloads, so avoiding the use of a workqueue for the majority of
      kswapd writeback allocation will minimise the impact of such
      behaviour.
      
      Hence it makes sense to move the stack switch to xfs_btree_split()
      and only do it for bmap btree splits. Stack switches during
      allocation will be much rarer, so there won't be significant
      performacne overhead caused by switching stacks. The worse case
      stack from all allocation paths will be split, not just writeback.
      And the majority of memory allocations will be done in the correct
      context (e.g. kswapd) without causing additional latency, and so we
      simplify the memory reclaim interactions between processes,
      workqueues and kswapd.
      
      The worst stack I've been able to generate with this patch in place
      is 5600 bytes deep. It's very revealing because we exit XFS at:
      
      37)     1768      64   kmem_cache_alloc+0x13b/0x170
      
      about 1800 bytes of stack consumed, and the remaining 3800 bytes
      (and 36 functions) is memory reclaim, swap and the IO stack. And
      this occurs in the inode allocation from an open(O_CREAT) syscall,
      not writeback.
      
      The amount of stack being used is much less than I've previously be
      able to generate - fs_mark testing has been able to generate stack
      usage of around 7k without too much trouble; with this patch it's
      only just getting to 5.5k. This is primarily because the metadata
      allocation paths (e.g. directory blocks) are no longer causing
      double splits on the same stack, and hence now stack tracing is
      showing swapping being the worst stack consumer rather than XFS.
      
      Performance of fs_mark inode create workloads is unchanged.
      Performance of fs_mark async fsync workloads is consistently good
      with context switches reduced by around 150,000/s (30%).
      Performance of dbench, streaming IO and postmark is unchanged.
      Allocation deadlock warnings have not been seen on the workloads
      that generated them since adding this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf11da9c
    • D
      Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64
      Dave Chinner 提交于
      This reverts commit 1f6d6482.
      
      This commit resulted in regressions in performance in low
      memory situations where kswapd was doing writeback of delayed
      allocation blocks. It resulted in significant parallelism of the
      kswapd work and with the special kswapd flags meant that hundreds of
      active allocation could dip into kswapd specific memory reserves and
      avoid being throttled. This cause a large amount of performance
      variation, as well as random OOM-killer invocations that didn't
      previously exist.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aa182e64
    • B
      aio: protect reqs_available updates from changes in interrupt handlers · 263782c1
      Benjamin LaHaise 提交于
      As of commit f8567a38 it is now possible to
      have put_reqs_available() called from irq context.  While put_reqs_available()
      is per cpu, it did not protect itself from interrupts on the same CPU.  This
      lead to aio_complete() corrupting the available io requests count when run
      under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
      disabling irq updates around the per cpu batch updates of reqs_available.
      
      Many thanks to Robert and folks for testing and tracking this down.
      Reported-by: NRobert Elliot <Elliott@hp.com>
      Tested-by: NRobert Elliot <Elliott@hp.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
      Cc: stable@vger.kenel.org
      263782c1
  11. 14 7月, 2014 3 次提交
  12. 13 7月, 2014 3 次提交