1. 06 12月, 2016 1 次提交
  2. 30 11月, 2016 1 次提交
    • J
      btrfs: increment ctx->pos for every emitted or skipped dirent in readdir · d2fbb2b5
      Jeff Mahoney 提交于
      If we process the last item in the leaf and hit an I/O error while
      reading the next leaf, we return -EIO without having adjusted the
      position.  Since we have emitted dirents, getdents() will return
      the byte count to the user instead of the error.  Subsequent callers
      will emit the last successful dirent again, and return -EIO again,
      with the same result.  Callers loop forever.
      
      Instead, if we always increment ctx->pos after emitting or skipping
      the dirent, we'll be sure that we won't hit the same one again.  When
      we go to process the next leaf, we won't have emitted any dirents
      and the -EIO will be returned to the user properly.  We also don't
      need to track if we've emitted a dirent already or if we've changed
      the position yet.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2fbb2b5
  3. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  4. 11 2月, 2016 1 次提交
    • D
      btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759
      David Sterba 提交于
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc4ef759
  5. 29 1月, 2014 2 次提交
  6. 29 6月, 2013 1 次提交
  7. 07 3月, 2013 1 次提交
    • C
      Btrfs: improve the delayed inode throttling · de3cb945
      Chris Mason 提交于
      The delayed inode code batches up changes to the btree in hopes of doing
      them in bulk.  As the changes build up, processes kick off worker
      threads and wait for them to make progress.
      
      The current code kicks off an async work queue item for each delayed
      node, which creates a lot of churn.  It also uses a fixed 1 HZ waiting
      period for the throttle, which allows us to build a lot of pending
      work and can slow down the commit.
      
      This changes us to watch a sequence counter as it is bumped during the
      operations.  We kick off fewer work items and have each work item do
      more work.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      de3cb945
  8. 20 2月, 2013 1 次提交
  9. 24 7月, 2012 1 次提交
    • J
      Btrfs: flush delayed inodes if we're short on space · 96c3f433
      Josef Bacik 提交于
      Those crazy gentoo guys have been complaining about ENOSPC errors on their
      portage volumes.  This is because doing things like untar tends to create
      lots of new files which will soak up all the reservation space in the
      delayed inodes.  Usually this gets papered over by the fact that we will try
      and commit the transaction, however if this happens in the wrong spot or we
      choose not to commit the transaction you will be screwed.  So add the
      ability to expclitly flush delayed inodes to free up space.  Please test
      this out guys to make sure it works since as usual I cannot reproduce.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      96c3f433
  10. 15 6月, 2012 1 次提交
  11. 27 7月, 2011 1 次提交
  12. 27 6月, 2011 1 次提交
    • M
      btrfs: fix inconsonant inode information · 2f7e33d4
      Miao Xie 提交于
      When iputting the inode, We may leave the delayed nodes if they have some
      delayed items that have not been dealt with. So when the inode is read again,
      we must look up the relative delayed node, and use the information in it to
      initialize the inode. Or we will get inconsonant inode information, it may
      cause that the same directory index number is allocated again, and hit the
      following oops:
      
      [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
      insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
      [ 5447.569766] ------------[ cut here ]------------
      [ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
      [SNIP]
      [ 5447.790721] Call Trace:
      [ 5447.793191]  [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
      [ 5447.800156]  [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
      [ 5447.806517]  [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
      [ 5447.812876]  [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
      [ 5447.818961]  [<ffffffff8111f840>] vfs_create+0x72/0x92
      [ 5447.824090]  [<ffffffff8111fa8c>] do_last+0x22c/0x40b
      [ 5447.829133]  [<ffffffff8112076a>] path_openat+0xc0/0x2ef
      [ 5447.834438]  [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
      [ 5447.841216]  [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
      [ 5447.847846]  [<ffffffff81121a79>] do_filp_open+0x3d/0x87
      [ 5447.853156]  [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
      [ 5447.859072]  [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
      [ 5447.864636]  [<ffffffff8111f179>] ? do_getname+0x14b/0x173
      [ 5447.870112]  [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
      [ 5447.875682]  [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
      [ 5447.880882]  [<ffffffff81112d39>] do_sys_open+0x69/0xae
      [ 5447.886153]  [<ffffffff81112db1>] sys_open+0x20/0x22
      [ 5447.891114]  [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
      
      Fix it by reusing the old delayed node.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2f7e33d4
  13. 18 6月, 2011 2 次提交
    • C
      Btrfs: avoid delayed metadata items during commits · e999376f
      Chris Mason 提交于
      Snapshot creation has two phases.  One is the initial snapshot setup,
      and the second is done during commit, while nobody is allowed to modify
      the root we are snapshotting.
      
      The delayed metadata insertion code can break that rule, it does a
      delayed inode update on the inode of the parent of the snapshot,
      and delayed directory item insertion.
      
      This makes sure to run the pending delayed operations before we
      record the snapshot root, which avoids corruptions.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e999376f
    • M
      btrfs: fix wrong reservation when doing delayed inode operations · 19fd2949
      Miao Xie 提交于
      We have migrated the space for the delayed inode items from
      trans_block_rsv to global_block_rsv, but we forgot to set trans->block_rsv to
      global_block_rsv when we doing delayed inode operations, and the following Oops
      happened:
      
      [ 9792.654889] ------------[ cut here ]------------
      [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
      btrfs_alloc_free_block+0xca/0x27c [btrfs]()
      [ 9792.654899] Hardware name: To Be Filled By O.E.M.
      [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
      ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
      arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
      snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
      pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
      i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
      mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
      ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
      drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
      [ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
      [ 9792.654920] Call Trace:
      [ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
      [ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
      [ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c [btrfs]
      [ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
      [ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
      [ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58 [btrfs]
      [ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368 [btrfs]
      [ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
      [ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
      [ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
      [ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
      [ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
      [ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
      [ 9792.655039]  [<ffffffffa03d41b1>] btrfs_update_delayed_inode+0x72/0x137 [btrfs]
      [ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb [btrfs]
      [ 9792.655062]  [<ffffffffa039a69b>] btrfs_commit_transaction+0x228/0x654 [btrfs]
      [ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
      [ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
      [ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
      [ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
      [ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
      [ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
      [ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
      [ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
      [ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
      [ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
      [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---
      
      This patch fix it by setting the reservation of the transaction handle to the
      correct one.
      Reported-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      19fd2949
  14. 21 5月, 2011 1 次提交
    • M
      btrfs: implement delayed inode items operation · 16cdcec7
      Miao Xie 提交于
      Changelog V5 -> V6:
      - Fix oom when the memory load is high, by storing the delayed nodes into the
        root's radix tree, and letting btrfs inodes go.
      
      Changelog V4 -> V5:
      - Fix the race on adding the delayed node to the inode, which is spotted by
        Chris Mason.
      - Merge Chris Mason's incremental patch into this patch.
      - Fix deadlock between readdir() and memory fault, which is reported by
        Itaru Kitayama.
      
      Changelog V3 -> V4:
      - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
        inode in time.
      
      Changelog V2 -> V3:
      - Fix the race between the delayed worker and the task which does delayed items
        balance, which is reported by Tsutomu Itoh.
      - Modify the patch address David Sterba's comment.
      - Fix the bug of the cpu recursion spinlock, reported by Chris Mason
      
      Changelog V1 -> V2:
      - break up the global rb-tree, use a list to manage the delayed nodes,
        which is created for every directory and file, and used to manage the
        delayed directory name index items and the delayed inode item.
      - introduce a worker to deal with the delayed nodes.
      
      Compare with Ext3/4, the performance of file creation and deletion on btrfs
      is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
      such as inode item, directory name item, directory name index and so on.
      
      If we can do some delayed b+ tree insertion or deletion, we can improve the
      performance, so we made this patch which implemented delayed directory name
      index insertion/deletion and delayed inode update.
      
      Implementation:
      - introduce a delayed root object into the filesystem, that use two lists to
        manage the delayed nodes which are created for every file/directory.
        One is used to manage all the delayed nodes that have delayed items. And the
        other is used to manage the delayed nodes which is waiting to be dealt with
        by the work thread.
      - Every delayed node has two rb-tree, one is used to manage the directory name
        index which is going to be inserted into b+ tree, and the other is used to
        manage the directory name index which is going to be deleted from b+ tree.
      - introduce a worker to deal with the delayed operation. This worker is used
        to deal with the works of the delayed directory name index items insertion
        and deletion and the delayed inode update.
        When the delayed items is beyond the lower limit, we create works for some
        delayed nodes and insert them into the work queue of the worker, and then
        go back.
        When the delayed items is beyond the upper bound, we create works for all
        the delayed nodes that haven't been dealt with, and insert them into the work
        queue of the worker, and then wait for that the untreated items is below some
        threshold value.
      - When we want to insert a directory name index into b+ tree, we just add the
        information into the delayed inserting rb-tree.
        And then we check the number of the delayed items and do delayed items
        balance. (The balance policy is above.)
      - When we want to delete a directory name index from the b+ tree, we search it
        in the inserting rb-tree at first. If we look it up, just drop it. If not,
        add the key of it into the delayed deleting rb-tree.
        Similar to the delayed inserting rb-tree, we also check the number of the
        delayed items and do delayed items balance.
        (The same to inserting manipulation)
      - When we want to update the metadata of some inode, we cached the data of the
        inode into the delayed node. the worker will flush it into the b+ tree after
        dealing with the delayed insertion and deletion.
      - We will move the delayed node to the tail of the list after we access the
        delayed node, By this way, we can cache more delayed items and merge more
        inode updates.
      - If we want to commit transaction, we will deal with all the delayed node.
      - the delayed node will be freed when we free the btrfs inode.
      - Before we log the inode items, we commit all the directory name index items
        and the delayed inode update.
      
      I did a quick test by the benchmark tool[1] and found we can improve the
      performance of file creation by ~15%, and file deletion by ~20%.
      
      Before applying this patch:
      Create files:
              Total files: 50000
              Total time: 1.096108
              Average time: 0.000022
      Delete files:
              Total files: 50000
              Total time: 1.510403
              Average time: 0.000030
      
      After applying this patch:
      Create files:
              Total files: 50000
              Total time: 0.932899
              Average time: 0.000019
      Delete files:
              Total files: 50000
              Total time: 1.215732
              Average time: 0.000024
      
      [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
      
      Many thanks for Kitayama-san's help!
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      16cdcec7